This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consistent Spectral Clustering in Hyperbolic Spaces

Sagar Ghosh Computer and Communication Sciences Division, Indian Statistical Institute, Kolkata Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata
Abstract

Clustering, as an unsupervised technique, plays a pivotal role in various data analysis applications. Among clustering algorithms, Spectral Clustering on Euclidean Spaces has been extensively studied. However, with the rapid evolution of data complexity, Euclidean Space is proving to be inefficient for representing and learning algorithms. Although Deep Neural Networks on hyperbolic spaces have gained recent traction, clustering algorithms or non-deep machine learning models on non-Euclidean Spaces remain underexplored. In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces to address this gap. Hyperbolic Spaces offer advantages in representing complex data structures like hierarchical and tree-like structures, which cannot be embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix, demonstrating improved efficiency compared to clustering in Euclidean Spaces. Our contributions include the development of the spectral clustering algorithm on Hyperbolic Spaces and the proof of its weak consistency. We show that our algorithm converges at least as fast as Spectral Clustering on Euclidean Spaces. To illustrate the efficacy of our approach, we present experimental results on the Wisconsin Breast Cancer Dataset, highlighting the superior performance of Hyperbolic Spectral Clustering over its Euclidean counterpart. This work opens up avenues for utilizing non-Euclidean Spaces in clustering algorithms, offering new perspectives for handling complex data structures and improving clustering efficiency.

Keywords— Hyperbolic Space, Spectral Clustering, Hierarchical Structure, Consistency

1 Introduction

In the realm of machine learning, the pivotal process of categorizing data points into cohesive groups remains vital for uncovering patterns, extracting insights, and facilitating various applications such as customer segmentation, anomaly detection, and image understanding [1]. Among the paradigms of clustering algorithms [2], alongside Partitional, Hierarchical, and Density-based techniques, Spectral Clustering on Euclidean spaces has garnered extensive research attention [3]. Spectral clustering operates in the spectral domain, utilizing the eigenvalues and eigenvectors of the Laplacian of a similarity graph constructed from the data. This algorithm initially constructs a similarity graph, where nodes represent data points and edges indicate pairwise similarities or affinities among the data points. It then computes the graph Laplacian matrix, capturing the graph’s structural properties and encoding relationships among the data points. Since its inception [See [4] and [5]], the Euclidean version of Spectral Clustering has significantly evolved. For simple clustering with two labels, this method considers the eigenvector corresponding to the second smallest eigenvalue of a specific graph Laplacian constructed from the affinity matrix based on the spatial position of the sample data. It then performs the 22-means clustering on the rows of the Eigenmatrix [whose columns consist of the eigenvectors corresponding to the smallest two eigenvalues], treating the rows as sample data points, and finally returns the cluster labels to the original dataset. This particular form of spectral clustering finds applications [6] in Speech Separation [7], Image Segmentation [8], Text Mining [9], VLSI design [10], and more. A comprehensive tutorial is also available at [3]. Here, we will briefly review the most commonly used Euclidean Spectral Clustering Algorithm.

When connectedness is a crucial criterion for the clustering algorithms, the conventional form of Spectral Clustering emerges as a highly effective approach. It transforms the standard data clustering problem in a given Euclidean space into a graph partitioning problem by representing each data point as a node in the graph. Subsequently, it determines the dataset labels by discerning the spectrum of the graph. Beginning with a set of data points X:=x1,x2,,xNlX:={x_{1},x_{2},...,x_{N}}\in\mathbb{R}^{l}. A symmetric similarity function (also known as the kernel function) ki,j:=k(xi,xj)k_{i,j}:=k(x_{i},x_{j}), along with the number of clusters kk, we construct the similarity matrix W:=w(i,j)=kijW:=w(i,j)=k_{ij}. In its simplest form, Spectral Clustering treats the similarity matrix WW as a graph and aims to bipartite the graph to minimize the sum of weights across the edges of the two partitions. Mathematically, we try to solve an optimization problem [3] of the following form:

minUN×lC:=minUN×lTr(UtLU)s.t. UtU=I,\displaystyle\min_{U\in\mathbb{R}^{N\times l}}C:=\min_{U\in\mathbb{R}^{N\times l}}Tr(U^{t}L^{\prime}U)\hskip 1.93748pt\text{s.t. }\hskip 1.93748ptU^{t}U=I, (1.1)

where L:=DWL:=D-W is the Graph Laplacian and the degree matrix D:=diag(d1,d2,,,,dN),di:=j=1Nw(i,j)D:=diag(d_{1},d_{2},,,,d_{N}),d_{i}:=\sum_{j=1}^{N}w(i,j) and L:=D1/2LD1/2=ID1/2WD1/2L^{\prime}:=D^{-1/2}LD^{-1/2}=I-D^{-1/2}WD^{-1/2}. UU is a label feature matrix, ll being the number of label features. Then, the simplest form of spectral clustering aims to minimize the trace by the feature matrix UU by considering the first ll eigenvectors of LL^{\prime} as its rows.

In order to extend the algorithm to form k(>2)k(>2) clusters, we consider the kk eigenvectors [u1,u2,,uk]N[u_{1},u_{2},...,u_{k}]\in\mathbb{R}^{N} corresponding to the kk smallest eigenvalues of L~\tilde{L}. Then, treating the NN rows of the Eigenmatrix U:=[u1,u2,,uk]U:=[u_{1},u_{2},...,u_{k}] as sample data points in k\mathbb{R}^{k}, we perform the kk-means clustering and label them. Finally, we return the labels to the original data points in the order they were taken to construct the degree matrix WW. Some of the variants of this algorithm can be found at [3].

Refer to caption
(a) Dataset WBCD
Refer to caption
(b) WBCD-ESCA Clusters
Refer to caption
(c) WBCD-HSCA Clusters
Figure 1: Consider the Wisconsin Breast Cancer Dataset taken from the UCL Machine Learning Repository. The leftmost figure describes t-SNE visualization of the clusters of two types of tumors: malignant(yellow dots) and benign(brown dots). Due to the presence of one predominant connected component, Euclidean spectral clustering forms only one cluster (with only one isolated point in the other cluster), whereas hyperbolic spectral clustering forms the clusters after separating two hierarchies present in the data. Clearly, the hyperbolic clusters provide much more accurate description of the dataset over its Euclidean counterpart.

This form of Euclidean Spectral Clustering works well on Euclidean datasets, which are relatively simple and have some distinct connectedness. But as data evolves rapidly and takes extremely complex shapes, Euclidean space becomes a more inefficient model space in capturing the intricate patterns in the data and thereby helping learning algorithms in that set-up. Thus, visualizing data from a more complex topological perspective and developing unsupervised learning techniques on these complex shapes can have a consequential impact on revealing the true nature of the data. Although Deep Neural Networks on hyperbolic spaces have recently gained some attention and found applications in various computer vision problems [see [11], [12], [13], [14]], clustering algorithms or non-deep machine learning models on non-Euclidean Spaces are one of the least explored domains. In this paper, we scrutinize this domain by proposing a spectral clustering algorithm on hyperbolic spaces. While Euclidean Space fails to provide meaningful information and lacks its representational power in implementing arbitrary tree/graph-like structures or hierarchical data, that can not be embedded in Euclidean Spaces with arbitrary low dimensions, and at the same time, these spaces do not preserve the associated metric (similarity measure/dissimilarity measure), hyperbolic spaces can be beneficial to implement the structure obtained from these examples [see [15], [16]] and even some hyperbolic spaces with shallow dimension can be very powerful to represent these data and clustering on these spaces can result in far better efficiency compared to the clustering in Euclidean Spaces like Figure 1. Our main contributions to this paper are:

  • We propose the spectral clustering algorithm on hyperbolic spaces in which an appropriate hyperbolic similarity matrix replaces the Euclidean similarity matrix.

  • We also provide a theoretical analysis concerning the weak consistency of the algorithm and prove that it converges (in the sense of distribution) at least as fast as the spectral clustering on Euclidean spaces.

  • Advancing a step further, we present the hyperbolic versions of some of the well-known variants of Euclidean spectral clustering, such as Fast Spectral Clustering with approximate eigenvectors (FastESC) [18] and the Approximate Spectral Clustering with kk- means-based Landmark Selection [39].

Having said that, we organize the rest of our paper in the following way. In Section 2, we will give a brief overview of the works related to the Euclidean Spectral Clustering and its variants. We will also discuss why we need to consider a hyperbolic version of the Euclidean Spectral Clustering and some of its variants. Section 3 lays out the mathematical backgrounds pertaining to our proposed algorithm. We will discuss several results which will enable us to formulate the algorithm in a rigorous manner. We give the details of our proposed algorithm in Section 4. We discuss the motivation behind the steps related to our algorithm. Section 5 has been dedicated for the proof of the weak consistency of the proposed algorithm. Section 6 presents and discusses the experimental results. Finally, conclusions are drawn in Section 7.

2 Related Works

We commence with a brief overview of prominent variants of Euclidean spectral clustering:

  1. 1.

    Bipartite Spectral Clustering on Graphs (ESCG): Introduced by Liu et al. [17] in 2013, this algorithm primarily aims to reduce the time complexity during spectral decomposition of the affinity matrix by appropriately transforming the input similarity matrix of a Graph dataset. The method involves randomly selecting d(n)d(\ll n) seeds from a given Graph of input size nn, followed by generating dd supernodes using Dijkstra’s Algorithm to find the shortest distance from the Graph nodes to the seeds. This process reduces the size of the similarity matrix W~:=RW\tilde{W}:=RW, where RR is the indicator matrix of size d×nd\times n and WW is the original affinity matrix. Subsequently, it proceeds with spectral decomposition of the normalized Z:=D21/2W~D11/2Z:=D_{2}^{-1/2}\tilde{W}D_{1}^{-1/2}, where D1D_{1} and D2D_{2} are diagonal matrices containing the column and row sums of W~\tilde{W}, respectively. The algorithm computes the kk largest eigenvectors of ZZtZZ^{t} and generates kk clusters based on the kk-means algorithm on the matrix U:=D11/2XU:=D_{1}^{-1/2}X, where XX is the right singular matrix in the singular value decomposition of ZZ.

  2. 2.

    Fast Spectral Clustering with approximate eigenvectors (FastESC): Developed by He et al. [18] in 2019, this algorithm initially performs kk-means clustering on the dataset with a number of clusters greater than the true clusters and then conducts spectral clustering on the centroids obtained from the kk-means. Similar to ESCG, this algorithm also focuses on reducing the size of the input similarity matrix for spectral clustering.

  3. 3.

    Low Rank Representation Clustering (LRR): Assuming a lower-rank representation of the dataset X:=[x1,x2,,xn]X:=[x_{1},x_{2},...,x_{n}], where each xix_{i} is the ii-th data vector in D\mathbb{R}^{D}, this algorithm aims to solve an optimization problem to minimize the rank of a matrix ZZ subject to X=AZX=AZ. Here, A=[a1,a2,,am]A=[a_{1},a_{2},...,a_{m}] is a dictionary, and Z:=[z1,z2,,zn]Z:=[z_{1},z_{2},...,z_{n}] is the coefficient matrix representing xix_{i} in a lower-dimensional subspace. The algorithm iteratively updates ZZ and an error matrix EE as proposed by Liu et al. in [19].

Among other variants of Spectral Clustering, such as Ultra-Scalable Spectral Clustering Algorithm (U-SPEC) [20] or Constrained Laplacian Rank Clustering (CLR) [21], the primary objective remains consistent - to enhance efficiency by reducing the burden of the spectral decomposition step through minimizing the size of the input similarity matrix. However, there has been minimal exploration regarding the translation of these algorithms into a hyperbolic setup. In this context, this marks the initial attempt to elevate non-deep machine learning algorithms beyond Euclidean Spaces.

Consider a dataset with a hierarchical structure, represented as a top-to-bottom tree. As we traverse along nodes away from the root, the distance among nodes at the same depth but with different parents grows exponentially with respect to their heights. This exponential growth renders Euclidean Space unsuitable as a model space for representing the hierarchy. Hyperbolic spaces are better suited for this purpose, where distance grows exponentially as we move away from the origin/root node. Some of these issues have been addressed in a series of papers [11], [12], [14] in the context of Deep Neural Networks. Recently in context to downstream self-supervised tasks, the authors in [22] presented scalable Hyperbolic Hierarchical Clustering (sHHC), which learns continuous hierarchies in hyperbolic space. Leveraging these hierarchies from sound and vision data, hierarchical pseudo-labels are constructed, leading to competitive performance in activity recognition compared to recent self-supervised learning models. In this work, we propose a general-purpose spectral clustering algorithm on a chosen hyperbolic space after embedding the original Euclidean dataset into the space in a particular way that minimally perturbs the inherent hierarchy.

3 Mathematical Preliminaries

For detailed exposure to the theoretical treatment, refer to [23], [24], and [25].

3.1 Smooth Manifold

Manifolds:

A topological space MM is locally Euclidean of dimension n, if for every pMp\in M, there exists a neighbourhood UU and a map ϕ\phi such that ϕ:Uϕ(U)\phi:U\to\phi(U), ϕ(U)\phi(U) being open in n\mathbb{R}^{n}, is a homeomorphism. We call such a pair (U,ϕ:Uϕ(U)U,\phi:U\to\phi(U)) a chart or co-ordinate system on UU. An nn dimensional Topological Manifold is a Hausdorff, Second Countable and Locally Euclidean of dimension nn [see [24]].

Charts and Atlas:

We say two charts (U,ϕ)(U,\phi) and (V,ψ)(V,\psi) to be CC^{\infty} compatible if and only if ϕψ1:ψ(UV)ϕ(UV)\phi\circ{\psi}^{-1}:\psi(U\cup V)\to\phi(U\cup V) and ψϕ1:ϕ(UV)ψ(UV)\psi\circ\ {\phi}^{-1}:\phi(U\cup V)\to\psi(U\cup V) are CC^{\infty} maps between Euclidean Spaces. A CC^{\infty} Atlas or simply an Atlas on a Euclidean Space MM is a collection 𝒜:={(Uα,ϕα)}αI\mathcal{A}:=\{(U_{\alpha},\phi_{\alpha})\}_{\alpha\in I} such that UαU_{\alpha}’s are pairwise CC^{\infty} compatible and M=αIUαM=\cup_{\alpha\in I}U_{\alpha}. It is worth noting that any Atlas is Contained in a unique Maximal Atlas.

Smooth Manifold:

A smooth or CC^{\infty} manifold MM is a topological manifold equipped with a maximal atlas. Throughout this article, we will exclusively refer to a smooth manifold when discussing manifolds.

Consider a point pnp\in\mathbb{R}^{n}. If two functions coincide on certain neighborhoods around pp, their directional derivatives will also be identical. This allows us to establish an equivalence relation among all CC^{\infty} functions within some neighborhood of pp. Let UU be a neighborhood of pp, and consider all f:Uf:U\to\mathbb{R} where ff is CC^{\infty}. We define two functions (f,U)(f,U) and (g,V)(g,V) as equivalent if there exists a neighborhood WUVW\subseteq U\cap V such that f=gf=g when restricted to WW. This equivalence relation holds for all CC^{\infty} functions at pp, forming what is known as the set of germs of CC^{\infty} functions on n\mathbb{R}^{n} at pp, denoted as Cp(n)C_{p}^{\infty}(\mathbb{R}^{n}). Similarly, this notion extends to manifolds through charts, where Cp(M)C_{p}^{\infty}(M) represents the germs of all CC^{\infty} functions on MM at pp.

Derivation and Tangent Space:

A Derivation 𝔻\mathbb{D} on MM is a linear map:

𝔻:Cp(M),\displaystyle\mathbb{D}:C_{p}^{\infty}(M)\to\mathbb{R},

satisfying the Libneiz’s Rule:

𝔻(fg)=𝔻(f)g(p)+f(p)𝔻(g).\displaystyle\mathbb{D}(fg)=\mathbb{D}(f)g(p)+f(p)\mathbb{D}(g).

A tangent vector at pp is a Derivation at pp. The vector space of all tangent vectors at pp is denoted as Tp(M)T_{p}(M).

Let (U,ϕ)(U,\phi) be a chart around pp. Let ϕ=(x1,x2,,xn)\phi=(x^{1},x^{2},...,x^{n}). Let r1,r2,,rnr^{1},r^{2},...,r^{n} be the standard coordinates on n\mathbb{R}^{n}. Then xi=riϕ:Ux^{i}=r^{i}\circ\phi:U\to\mathbb{R}. If fCp(M)f\in C_{p}^{\infty}(M), then xi|p(f):=ri|ϕ(p)(fϕ1)\frac{\partial}{\partial x^{i}}|_{p}(f):=\frac{\partial}{\partial r^{i}}|_{\phi(p)}(f\circ\phi^{-1}). If vTp(M)v\in T_{p}(M), v=i=1naixiv=\sum_{i=1}^{n}a_{i}\frac{\partial}{\partial x^{i}}.

Vector Fields:

A Vector Field on an open subset UU of MM is a function that assigns to each point pUp\in U to a tangent vector XpX_{p} in Tp(M)T_{p}(M). Since {xi|p}1in\{\frac{\partial}{\partial x^{i}}|_{p}\}_{1\leq i\leq n} form a local basis of Tp(M)T_{p}(M), Xp=i=1nai(p)xiX_{p}=\sum_{i=1}^{n}a_{i}(p)\frac{\partial}{\partial x^{i}}.

Cotangent Space and Differential Forms:

Let us also talk briefly about the dual of Tangent Space, also known as the Cotangent Space. It consists of the space of all covectors or linear functionals on the Tangent Space Tp(M)T_{p}(M) and is denoted by Tp(M)T^{*}_{p}(M). Similarly, here we can define Covector Fields or Differential 1-Form on an open set UU of MM that assigns to each point pp, a covector ωpTp(M)\omega_{p}\in T^{*}_{p}(M). Staying consistent with the notation, {dxi|p}1in\{dx^{i}|_{p}\}_{1\leq i\leq n} forms a local basis of Cotangent Space dual to {xi|p}1in\{\frac{\partial}{\partial x^{i}}|_{p}\}_{1\leq i\leq n}, i.e.

dxi|P({xj|p})=δji,\displaystyle dx^{i}|_{P}\left(\left\{\frac{\partial}{\partial x^{j}}|_{p}\right\}\right)=\delta^{i}_{j},

where δji=1\delta^{i}_{j}=1 when i=ji=j and is 0 otherwise.

Tensor Product:

If ff is a kk-linear function and gg is a ll-linear function on a vector space VV, their tensor product is defined to be the (k+l)(k+l) linear function fgf\otimes g as

fg(v1,v2,,vk+l)=f(v1,,vk)g(vk+1,,vk+l),\displaystyle f\otimes g(v_{1},v_{2},...,v_{k+l})=f(v_{1},...,v_{k})g(v_{k+1},...,v_{k+l}),

where v1,,vk+lv_{1},...,v_{k+l} are arbitrary vectors in VV.

3.2 Riemannian Manifold

Riemannian Manifolds:

A Riemannian Manifold MM is a smooth manifold equipped with a smooth Riemannian Metric [23], i.e. a CC^{\infty} map g:MTM2g:M\to T^{*}M^{\otimes 2}, i.e. pM\forall p\in M, gpTpM2g_{p}\in T^{*}_{p}M^{\otimes 2}, i.e. gpg_{p} is a bilinear functional on Tp(M)T_{p}(M). [gp:Tp(M)×Tp(M)g_{p}:T_{p}(M)\times T_{p}(M)\to\mathbb{R}] If ϕ=(x1,x2,,xn)\phi=(x^{1},x^{2},...,x^{n}) is a chart on UMU\subseteq M, taking ei=xie_{i}=\frac{\partial}{\partial x^{i}} and ei=dxie^{*}_{i}=dx^{i},

g=i,jg(ei,ej)eiej.\displaystyle g=\sum_{i,j}g(e_{i},e_{j})e^{*}_{i}\otimes e^{*}_{j}.

Now we state an elementary result from the Riemannian Geometry without proof.

Proposition 3.1:

Every smooth manifold admits a Riemannian Metric.

Corollary 1:

The Usual Spectral Clustering Algorithm on Euclidean Spaces can be applied to Riemannian Manifolds with the Euclidean Metric replaced by Riemannian Metric, at least locally. In particular it can be applied to compact subsets of model hyperbolic spaces with the corresponding hyperbolic metric.

Length and Distance:

Let γ:[a,b](M,g)\gamma:[a,b]\to(M,g) be a piecewise smooth curve. The length of γ\gamma, L(γ)L(\gamma) is defined as, L(γ):=abgγ(t)(γ(t),γ(t))1/2L(\gamma):=\int_{a}^{b}g_{\gamma(t)}(\gamma^{\prime}(t),\gamma^{\prime}(t))^{1/2}. Given two points p,qMp,q\in M, we define the distance between pp and qq as: dg(p,q):=inf{L(γ)d_{g}(p,q):=\inf\{L(\gamma) such that γ\gamma is a peicewise smooth curve and γ(a)=p,γ(b)=q}\gamma(a)=p,\gamma(b)=q\}. This distance is also known as the Geodesic Distance between pp and qq on MM.

Proposition 3.2: Let MM be a Riemannian Manifold with Riemannian metric gg. Then MM is a metric space with the distance function dgd_{g} defined as above. The metric topology on MM coincides with the manifold topology.

Let χ(M)\chi(M) denote the space of all CC^{\infty} vector fields on MM. Let Xχ(M)X\in\chi(M). Then X:MTMn×nX:M\to TM\xhookrightarrow{}\mathbb{R}^{n}\times\mathbb{R}^{n}. X(p)=(p,v)X(p)=(p,v). Considering only second component of a vector field, X:MnnX:M\subseteq\mathbb{R}^{n}\to\mathbb{R}^{n}. Hence dXp:TpMndX_{p}:T_{p}M\to\mathbb{R}^{n}. Given vTpMv\in T_{p}M, we define the covariant derivative of XX in the direction of vv by vX=dXp(v)tTpM\nabla_{v}X=dX_{p}(v)^{t}\in T_{p}M. For Yχ(M)Y\in\chi(M), YX\nabla_{Y}X is defined as (YX)(p)=Y(p)X,pM(\nabla_{Y}X)(p)=\nabla_{Y(p)}X,p\in M. This defines a map :χ(M)×χ(M)χ(M)\nabla:\chi(M)\times\chi(M)\to\chi(M) as

  1. (i)

    fX+gYZ=fXZ+gYZ,X,Y,Zχ(M)\nabla_{fX+gY}Z=f\nabla_{X}Z+g\nabla_{Y}Z,\forall X,Y,Z\in\chi(M) and f,gC(M)f,g\in C^{\infty}(M).

  2. (ii)

    X(αY+βZ)=αXY+βXZ,α,β\nabla_{X}(\alpha Y+\beta Z)=\alpha\nabla_{X}Y+\beta\nabla_{X}Z,\forall\alpha,\beta\in\mathbb{R} and X,Y<Zχ(M)X,Y<Z\in\chi(M).

  3. (iii)

    X(fY)=fXY+(Xf)Y,X<Yχ(M)\nabla_{X}(fY)=f\nabla_{X}Y+(Xf)Y,\forall X<Y\in\chi(M) and fC(M)f\in C^{\infty}(M).

This \nabla is called an induced connection on MM and it satisfies two properties:

  1. 1.

    (Symmetry) XYYX=[X,Y]=XYYX\nabla_{X}Y-\nabla_{Y}X=[X,Y]=XY-YX.

  2. 2.

    (Metric Compatibility) X<Y,Z>=<XY,Z>+<Y,XZ>,X,Y,Zχ(M)X<Y,Z>=<\nabla_{X}Y,Z>+<Y,\nabla_{X}Z>,\forall X,Y,Z\in\chi(M).

This \nabla on (M,g)(M,g) is called the Levi-Civita Connection or the Riemannian Connection. It can be proved that there exists a unique such connection on a given Riemannian Manifold (M,g)(M,g) satisfying symmetry and metric compatibility.

Curvature Operator:

The curvature operator on MM is defined as the \mathbb{R}-trilinear map:
R:χ(M)×χ(M)×χ(M)χ(M)R:\chi(M)\times\chi(M)\times\chi(M)\to\chi(M) as (X,Y,Z)R(X,Y)Z:=YXZXYZ+[X,Y]Z(X,Y,Z)\to R(X,Y)Z:=\nabla_{Y}\nabla_{X}Z-\nabla_{X}\nabla_{Y}Z+\nabla_{[X,Y]}Z.

Sectional Curvature:

Let pMp\in M and σTpM\sigma\subset T_{p}M be a two dimensional subspace. The curvature of MM in the direction of σ\sigma is defined as

κ(p,σ):=<R(x,y)x,y>|xy|2,\displaystyle\kappa(p,\sigma):=\frac{<R(x,y)x,y>}{|x\wedge y|^{2}}, (3.1)

where

|xy|2=x2y2|<x,y>|2.\displaystyle|x\wedge y|^{2}=\|x\|^{2}\|y\|^{2}-|<x,y>|^{2}. (3.2)

Hyperbolic Spaces:

An nn-dimensional hyperbolic space is characterized as the singularly connected, nn-dimensional complete Riemannian Manifold with a constant sectional curvature of 1-1. Represented as n\mathbb{H}^{n}, it stands as the exclusive simply connected nn-dimensional complete Riemannian Manifold with a constant negative sectional curvature of 1-1. The renowned Killing-Hopf Theorem [26] affirms that any two such Riemannian Manifolds are isometrically equivalent. While numerous isometric model spaces exist, we will only touch upon four of them briefly. The geodesics on these spaces are illustrated in Figure 2.

  1. 1.

    Poincaré Half Space Model: This is the upper half plane H:={(x1,x2,,xn)|xn>0,xii{1,2,,n}}H:=\{(x_{1},x_{2},...,x_{n})|x_{n}>0,x_{i}\in\mathbb{R}\forall i\in\{1,2,...,n\}\}. The metric on this space is defined as follows: suppose p1:=(x1,x2,,xn)p_{1}:=(x_{1},x_{2},...,x_{n}) and p2:=(y1,y2,,yn)p_{2}:=(y_{1},y_{2},...,y_{n}) are two points whose distance we want to compute. Let p1~:=x1,x2,,xn1,xn\tilde{p_{1}}:={x_{1},x_{2},...,x_{n-1},-x_{n}} be the reflection of p1p_{1} with respect to the plane xn=0x_{n}=0. Then

    d(p1,p2)\displaystyle d(p_{1},p_{2}) :=2sinh1p2p12xnyn\displaystyle:=2\sinh^{-1}\frac{\|p_{2}-p_{1}\|}{2\sqrt{x_{n}y_{n}}} (3.3)
    =2logp2p1+p2p1~2xnyn\displaystyle=2\log\frac{\|p_{2}-p_{1}\|+\|p_{2}-\tilde{p_{1}}\|}{2\sqrt{x_{n}y_{n}}} (3.4)
  2. 2.

    Poincaré Disc Model: This is a model hyperbolic space in which all points are inside the unit ball in n\mathbb{R}^{n} and geodesics are either diameters or circular arcs perpendicular to the unit sphere. The metric between two points p1p_{1} and p2p_{2} (p1,p2<1\|p_{1}\|,\|p_{2}\|<1)is defined as

    d(p1,p2):=\displaystyle d(p_{1},p_{2}):=
    cosh1(1+2p2p12(1p12)(1p22)).\displaystyle\cosh^{-1}\left(1+\frac{2\|p_{2}-p_{1}\|^{2}}{(1-\|p_{1}\|^{2})(1-\|p_{2}\|^{2})}\right).
  3. 3.

    Beltrami-Klein Model: This is a model hyperbolic space in which points are represented by the points in the open unit disc in n\mathbb{R}^{n} and lines are represented by chords, special type of straight lines with ideal end points on the unit sphere. For u,vBu,v\in B, the open unit ball, the distance in this model is given as

    d(u,v):=12log(aqpbapqb),\displaystyle d(u,v):=\frac{1}{2}\log\left(\frac{\|aq\|\|pb\|}{\|ap\|\|qb\|}\right), (3.5)

    where a,ba,b are ideal points on the boundary sphere of the line (p,q)(p,q) with aq>ap\|aq\|>\|ap\| and pb>bq\|pb\|>\|bq\|.

  4. 4.

    Hyperboloid Model: This is a model hyperbolic space also known as Minkowski Model which is the forward sheet S+S^{+} of the two hyperbolic sheets embedded in the (n+1)(n+1) dimensional Minkowski Space. The hyperbolic distance between two points u=(x0,x1,,xn)u=(x_{0},x_{1},...,x_{n}) and v=(y0,y1,,yn)v=(y_{0},y_{1},...,y_{n}) is given as

    d(u,v):=cosh1(B(u,v)),\displaystyle d(u,v):=\cosh^{-1}(-B(u,v)), (3.6)

    where

    B((x0,x1,,xn),(y0,y1,,yn))\displaystyle B((x_{0},x_{1},...,x_{n}),(y_{0},y_{1},...,y_{n}))
    =x0y0+i=1nxiyi.\displaystyle=-x_{0}y_{0}+\sum_{i=1}^{n}x_{i}y_{i}.

    Note that BB is simply the Minkowski Dot Product.

Gyrovector Space:

The concept of Gyrovector Space, introduced by Abraham A. Ungar [see [27]], serves as a framework for studying vector space structures within Hyperbolic Space. This abstraction allows for the definition of special addition and scalar multiplications based on weakly associative gyrogroups. For a detailed geometric formalism of these operations, Vermeer’s work [28] provides an in-depth exploration.

In this context, we will briefly discuss Mo¨\ddot{o}bius Gyrovector Addition and Mobius Scalar Multiplication on the Poincar’e Disc. Due to isometric transformations between hyperbolic spaces of different dimensions, the same additive and multiplicative structures can be obtained for other model hyperbolic spaces [see [29]]. Utilizing Mo¨\ddot{o}bius addition and multiplication is essential when evaluating intrinsic metrics like the Davies-Bouldin Score or Calinski-Harabasz Index to assess the performance of our proposed algorithm.

  1. 1.

    Mo¨\ddot{o}bius Addition: For two points uu and vv in the Poincaré Disc, the Mo¨\ddot{o}bius addition is defined as: uKv:=(1+2K<u,v>+Kv2)u+(1Ku2)v1+2K<u,v>+K2u2v2,u\bigoplus_{K}v:=\frac{(1+2K<u,v>+K\|v\|^{2})u+(1-K\|u\|^{2})v}{1+2K<u,v>+K^{2}\|u\|^{2}\|v\|^{2}}, where KK is the curvature and for hyperbolic spaces, K=1K=-1.

  2. 2.

    Mo¨\ddot{o}bius Scalar Multiplication: For rr\in\mathbb{R}, c>0c>0 and uu in the Poincaré Disc, the scalar multiplication is defined as: rcu:=1ctanh(rtanh1(cu))uur\bigotimes_{c}u:=\frac{1}{\sqrt{c}}\tanh\left(r\tanh^{-1}(\sqrt{c}\|u\|)\right)\frac{u}{\|u\|} This addition and scalar multiplication satisfy the axioms pertaining to the Gyrovector Group [see [27]].

Refer to caption
(a) Geodesics in Poincare Half Space Model
Refer to caption
(b) Geodesics in Poincare Disc Model
Refer to caption
(c) Geodesics in Klein-Beltrami Model
Refer to caption
(d) Geodesics in Minkowski Hyperboloid Model
Figure 2: Geodesics in Different Model Hyperbolic Spaces

4 Proposed Algorithm

Before discussing the proposed spectral clustering algorithm, we will briefly talk about an equivalent formulation of the Euclidean Spectral Clustering algorithm. In the Introduction LABEL:sec:introduction section, we described the spectral clustering problem as trace minimization problem. In its simplest form, if the two clusters are AA and BB, then solving 1.1 is equivalent to minimize the normalized cut (N-Cut) problem

Ncut(A,B):=Cut(A,B)Vol(A)+Cut(A,B)Vol(B),\displaystyle Ncut(A,B):=\frac{Cut(A,B)}{Vol(A)}+\frac{Cut(A,B)}{Vol(B)}, (4.1)

where Cut(A,B):=iA,jBW(i,j)Cut(A,B):=\sum_{i\in A,j\in B}W(i,j) and Vol(A):=iA,jVW(i,j),Vol(B):=iB,jVW(i,j)Vol(A):=\sum_{i\in A,j\in V}W(i,j),Vol(B):=\sum_{i\in B,j\in V}W(i,j). But 4.1 is an NP-hard problem in general. To ease this problem, we tend to solve another equivalent minimization problem

minJy:yi{1Vol(A),1Vol(B)}:=ytLyytDysubject toytD𝟏=𝟎.\displaystyle\min J_{y:y_{i}\in\{\frac{1}{Vol(A)},\frac{-1}{Vol(B)}\}}:=\frac{y^{t}Ly}{y^{t}Dy}\hskip 1.93748pt\text{subject to}\hskip 1.93748pty^{t}D\bf{1}=0. (4.2)

Here we choose yNy\in\mathbb{R}^{N} such that yi:={1Vol(A),ifiA1Vol(B),ifiB.y_{i}:=\begin{cases}\frac{1}{Vol(A)},\text{if}\hskip 7.74997pti\in A\\ \frac{-1}{Vol(B)},\text{if}\hskip 7.74997pti\in B.\end{cases}

In order to solve 4.2 we construct the Normalized Graph Laplacians as

L:=D1/2LD1/2=ID1/2WD1/2\displaystyle L^{\prime}:=D^{-1/2}LD^{-1/2}=I-D^{-1/2}WD^{-1/2}

or

L′′:=D1L=ID1W\displaystyle L^{\prime\prime}:=D^{-1}L=I-D^{-1}W

and then converting 4.2 to

minz:ztD1/2𝟏=𝟎J:=ztL~zztz,\displaystyle\min_{z:z^{t}D^{1/2}\bf{1}=0}J:=\frac{z^{t}\tilde{L}z}{z^{t}z}, (4.3)

where L~\tilde{L} is either LL^{\prime} or L′′L^{\prime\prime}. Then solving the generalized eigenvalue problem of the form Lf=λDfLf=\lambda Df of 4.2 is transformed into an eigenvalue problem of the form L~f=λf\tilde{L}f=\lambda f. Since the smallest eigenvalue of L~\tilde{L} is 0, the Rayleigh Quotient 4.3 is minimized at the eigenvector corresponding to the next smallest eigenvalue. More generally, if the number of clusters is kk, we have to choose the eigenvectors corresponding to the kk smallest eigenvalues.

Having all these preliminaries and the equivalent formulations, we are now ready to discuss our proposed algorithm. At first, we need to identify carefully the appropriate model space on which we will consider our data points. Since the hyperbolic space of dimension nn is not bounded, if we consider data points spreading across the entire space, we might run into problems to verify its consistency. To remove this shortcoming, we will consider only a compact subset of our model space chosen carefully.

Since the boundary of the model hyperbolic space n\mathbb{H}^{n} is at infinite distance from the origin (for example, the plane xn=0x_{n}=0 in Poincaré half space model or the unit circle in Poincaré disc model), we have to consider only a compact subset of n\mathbb{H}^{n} in order to talk about general notion of convergence. More specifically, we will fix some δ>0\delta>0 and consider the following subsets of n\mathbb{H}^{n} for either Poincaré half space model or disc model:

  • Poincaré Half Space Model: Consider H:={(x1,x2,,xn)|xnδ}H^{\prime}:=\{(x_{1},x_{2},...,x_{n})|x_{n}\geq\delta\}.

  • Poincaré Disc Model: Consider H":={x𝔻nsuch thatx(1δ)}H^{"}:=\{x\in\mathbb{D}^{n}\hskip 7.74997pt\text{such that}\hskip 7.74997pt\|x\|\leq(1-\delta)\}.

Note that if we choose δ>0\delta>0 sufficiently small, then the clusters will not be affected by the choice of δ\delta. Since the formation of clusters depends only on the spatial position of the data points among themselves, choosing a bigger δ\delta only affect in terms of scaling. However, in order to utilize the hyperbolic nature of n\mathbb{H}^{n} we will choose δ\delta to be very small, let’s say in the order of 10410^{-4}.

Since every nn- dimensional model hyperbolic space is isometric to each other by the Killing-Hopf Theorem, for the purpose of convergence, it is reasonable to talk about a compact subset of only one of the model hyperbolic spaces. For our purpose, we will only consider the above mentioned compact subset of the Poincaré Disc Model and denote it by HH, more explicitly, H:{x𝔻nwithx(1δ)}H:\{x\in\mathbb{D}^{n}\hskip 3.87498pt\textit{with}\hskip 3.87498pt\|x\|\leq(1-\delta)\} equipped with the Poincaré metric

d(x1,x2):=cosh1(1+2x2x12(1x12)(1x22)).\displaystyle d(x_{1},x_{2}):=cosh^{-1}\left(1+\frac{2\|x_{2}-x_{1}\|^{2}}{(1-\|x_{1}\|^{2})(1-\|x_{2}\|^{2})}\right).
Refer to caption
Figure 3: Embedding of the dataset from the Euclidean Space into the Poincaré Disc, the left figure describes how a natural hierarchy looks like in the Euclidean Space, the right figure describes how the embedded dataset looks like on the Poincaré Disc.

4.1 The Hyperbolic Spectral Clustering Algorithm (HSCA)

Let X={x1,x2,,xN}lX=\{x_{1},x_{2},...,x_{N}\}\in\mathbb{R}^{l} for some ll\in\mathbb{N}. We will follow the graph-based clustering methods where the edge of a graph (in the Euclidean Space) will be replaced by corresponding Geodesics in HH. Below we elaborate on the steps of HSCA. For clarity, a pseudo-code of the algorithm is provided in Algorithm 1.

  1. 1.

    Embed the original dataset into the Poincaré Disc of unit radius. Obtain 𝒳:={x1,x2,,xN}H\mathcal{X^{\prime}}:=\{x_{1}^{\prime},x_{2}^{\prime},...,x_{N}^{\prime}\}\in H such that xi:=xixi+δx_{i}^{\prime}:=\frac{x_{i}}{\|x_{i}\|+\delta}, where δ>0\delta>0 can be taken any small number, let’s say δ=102\delta=10^{-2}, like Figure 3.

  2. 2.

    Construct the graph G(V,E)G(V,E) where V={x1,x2,,xN}V=\{x_{1}^{\prime},x_{2}^{\prime},...,x_{N}^{\prime}\}. The edge set E:={ei,j(xi,xj)|eij}E:=\{e_{i,j}(x_{i}^{\prime},x_{j}^{\prime})|e_{ij}\} where eije_{ij} is the geodesic from xix_{i}^{\prime} to xjx_{j}^{\prime} on HH.

  3. 3.

    For the each geodesic-edge eije_{ij} (from now on we will use edge instead of geodesic-edge) attach a weight wijw_{ij} which measures the similarity between the respective nodes xix_{i}^{\prime} and xjx_{j}^{\prime}. The set of weights define the proximity matrix or affinity matrix, this is the N×NN\times N matrix WW such that

    W[W(i,j)]=[wij],i,j{1,2,,N}.\displaystyle W\equiv[W(i,j)]=[w_{ij}],i,j\in\{1,2,...,N\}.

    The proximity matrix WW is symmetric. A common choice of the proximity matrix is the Gaussian Kernel or the Poisson Kernel(the same Gaussian kernel or Poisson Kernel on Euclidean Space with the metric replaced by geodesic distance on HH). WW can be represented as

    W(i,j)={exp(dg(xi,xj)2σ2),if dg(xi,xj)ϵ,0,otherwise.\displaystyle W(i,j)=\begin{cases}\exp(-\frac{d_{g}(x_{i}^{\prime},x_{j}^{\prime})^{2}}{\sigma^{2}}),\hskip 5.69054pt\text{if $d_{g}(x_{i}^{\prime},x_{j}^{\prime})\leq\epsilon$},\\ 0,\hskip 5.69054pt\text{otherwise.}\end{cases}

    for the Gaussian Kernel, or

    W(i,j)={exp(dg(xi,xj)2σ),if dg(xi,xj)ϵ0,otherwise\displaystyle W(i,j)=\begin{cases}\exp(-\frac{d_{g}(x_{i}^{\prime},x_{j}^{\prime})}{2\sigma}),\hskip 5.69054pt\text{if $d_{g}(x_{i}^{\prime},x_{j}^{\prime})\leq\epsilon$}\\ 0,\hskip 5.69054pt\text{otherwise}\end{cases}

    for the Poisson Kernel, where ϵ>0\epsilon>0 is pre-defined cut-off distance set by the user.

  4. 4.

    Now treat WW as the new data matrix, i.e. treat each row (or column since WW is symmetric) of WW as new data lying in N\mathbb{R}^{N}, call each row of WW by wi,i{1,2,,N}w_{i},i\in\{1,2,...,N\}, where each wiw_{i} is a representative of the original data xi,i{1,2,,N}x_{i},i\in\{1,2,...,N\} . Compute the modified affinity matrix W:=exp(wiwj2σ2)W^{\prime}:=exp\left(-\frac{\|w_{i}-w_{j}\|^{2}}{\sigma^{2}}\right).

  5. 5.

    Next compute the degree matrix D:=diag(d1,d2,,dN)D:=diag(d_{1},d_{2},...,d_{N}), where di:=j=1NW(i,j)d_{i}:=\sum_{j=1}^{N}W^{\prime}(i,j). Construct the graph Laplacian L:=DWL:=D-W^{\prime}. Next construct the normalized graph Laplacian L~:=D1/2LD1/2=ID1/2WD1/2\tilde{L}:=D^{-1/2}LD^{-1/2}=I-D^{-1/2}W^{\prime}D^{-1/2}.

  6. 6.

    Compute the first kk eigenvectors u1,u2,,uku_{1},u_{2},...,u_{k} of the normalized graph Laplacian L~\tilde{L} corresponding to the first kk eigenvalues of 0=λ1λ2λk0=\lambda_{1}\leq\lambda_{2}\leq...\leq\lambda_{k} of L~\tilde{L}. Call this matrix U:=[u1uk]N×kU:=[u_{1}...u_{k}]\in\mathbb{R}^{N\times k}. Normalize the rows of UU, construct TN×kT\in\mathbb{R}^{N\times k} such that

    T(i,j):=U(i,j)l=1kU(i,l)2.\displaystyle T(i,j):=\frac{U(i,j)}{\sqrt{\sum_{l=1}^{k}U(i,l)^{2}}}.
  7. 7.

    For i=1,2,,Ni=1,2,...,N, let yiky_{i}\in\mathbb{R}^{k} be the ii-th row of TT.

  8. 8.

    Form clusters C1,C2,CkC_{1},C_{2},...C_{k} from the points {yi}i=1N\{y_{i}\}_{i=1}^{N} using kk-means algorithm. The clusters are Cj:={i|yiCj}C_{j}:=\{i|y_{i}\in C_{j}\}.

4.1.1 Explanation:

Step 1 is clear, since our model space is a subset of the Poincaré Disc, we have to embed each data point into the space. While doing that we have taken each point and divided it by it’s Euclidean norm plus some small positive integer. This will ensure that the data points in the similar hierarchy will be closed in the embedding. Next we have computed the affinity matrix WW by treating each edge between points as geodesic edge and have the Poincaré metric absorbed into the corresponding Kernel. Next we have constructed the modified similarity matrix WW^{\prime} mainly by keeping the idea that if the points xix_{i}^{\prime} and xjx_{j}^{\prime} in the embedded dataset are closed to a set of same points, then they should ideally belong to the same cluster. The spectral decomposition modified similarity matrix WW^{\prime} will ensure that the representative points in the rows of WW will form the cluster closed to the ground level. Rest of the steps are taken from the well known normalized Euclidean Spectral Clustering Algorithm proposed by Ng, Jordan and Weiss [30].

Algorithm 1 Hyperbolic Spectral Clustering Algorithm (HSCA)

Input: Dataset 𝒳:={x1,x2,,xN}n\mathcal{X}:=\{x_{1},x_{2},...,x_{N}\}\in\mathbb{R}^{n}, number of clusters=kk, hyperparameter σ\sigma, cut-off length=ϵ\epsilon.

Output: Cluster labels 𝒞:={C1,C2,,Ck}\mathcal{C}:=\{C_{1},C_{2},...,C_{k}\} where Ci:={j|xjCi}C_{i}:=\{j|x_{j}\in C_{i}\}.

1:Transformation: Obtain 𝒳:={x1,x2,,xN}H\mathcal{X^{\prime}}:=\{x_{1}^{\prime},x_{2}^{\prime},...,x_{N}^{\prime}\}\in H such that xi:=xixi+δx_{i}^{\prime}:=\frac{x_{i}}{\|x_{i}\|+\delta}, where δ>0\delta>0 small, let’s say δ=102\delta=10^{-2}.
2:Constructing The Similarity Matrix: Construct WN×NW\in\mathbb{R}^{N\times N} with W(i,j):={exp(dg(xi,xj)2σ2),ifdg(xi,xj)ϵ0,otherwise.W(i,j):=\begin{cases}exp\left(-\frac{d_{g}(x_{i}^{\prime},x_{j}^{\prime})^{2}}{\sigma^{2}}\right),\hskip 6.02777pt\text{if}\hskip 3.01389ptd_{g}(x_{i}^{\prime},x_{j}^{\prime})\leq\epsilon\\ 0,\hskip 6.02777pt\text{otherwise}.\end{cases} or     W(i,j):={exp(dg(xi,xj)2σ),if dg(xi,xj)ϵ0,otherwiseW(i,j):=\begin{cases}\exp(-\frac{d_{g}(x_{i},x_{j})}{2\sigma}),\hskip 5.69054pt\text{if $d_{g}(x_{i},x_{j})\leq\epsilon$}\\ 0,\hskip 5.69054pt\text{otherwise}\end{cases}
3:Constructing New Dataset and Computing Modified Affinity Matrix: Treat each row wiw_{i} of WW as new data point and compute W:=exp(wiwj2σ2)W^{\prime}:=exp\left(-\frac{\|w_{i}-w_{j}\|^{2}}{\sigma^{2}}\right).
4:Constructing the Degree Matrix: Construct the diagonal degree matrix DN×ND\in\mathbb{R}^{N\times N} with D(i,j):={j=1NW(i,j),ifi=j0,otherwise.D(i,j):=\begin{cases}\sum_{j=1}^{N}W^{\prime}(i,j),\hskip 6.02777pt\text{if}\hskip 3.01389pti=j\\ 0,\hskip 6.02777pt\text{otherwise}.\end{cases}
5:Construct the Normalized Graph Laplacian: Obtain L:=DWN×NL:=D-W^{\prime}\in\mathbb{R}^{N\times N}. Then Construct L~:=D1/2LD1/2N×N\tilde{L}:=D^{-1/2}LD^{-1/2}\in\mathbb{R}^{N\times N}.
6:Spectral Decomposition of the Normalized Graph Laplacian: Obtain the first kk eigenvalues of L~\tilde{L}, 0=λ1λ2λk0=\lambda_{1}\leq\lambda_{2}\leq...\leq\lambda_{k} and the corresponding eigenvectors ukNu_{k}\in\mathbb{R}^{N}. Let U:=[u1,u2,uk]N×kU:=[u_{1},u_{2},...u_{k}]\in\mathbb{R}^{N\times k}.
7:Normalizing the Eigen Matrix: Normalize the rows of UU, obtain TN×kT\in\mathbb{R}^{N\times k} such that T(i,j):=U(i,j)l=1kU(i,l)2T(i,j):=\frac{U(i,j)}{\sqrt{\sum_{l=1}^{k}U(i,l)^{2}}}.
8:Representative Points on k\mathbb{R}^{k}: Let 𝒴:={y1,y2,yN}k\mathcal{Y}:=\{y_{1},y_{2},...y_{N}\}\in\mathbb{R}^{k}, where yiy_{i} represents xix_{i} for i{1,2,,N}i\in\{1,2,...,N\} and yij=U(i,j)y_{i}^{j}=U(i,j).
9:Cluster Formation: Obtain the clusters C1,C2,,CkC_{1},C_{2},...,C_{k} by performing kk- means clustering on 𝒴\mathcal{Y}, where Ci:={j|yjCi}C_{i}:=\{j|y_{j}\in C_{i}\}.

Time Complexity:

The most computationally intense step is to perform the spectral decomposition of the similarity matrix which takes time 𝒪(n3)\mathcal{O}(n^{3}) [31]. Calculating the pairwise Poincare Distance and the similarity matrix each takes 𝒪(n2)\mathcal{O}(n^{2}) time. Finally applying the kk-means during the eigenvalue decomposition takes 𝒪(nldk)\mathcal{O}(nldk) amount of time, where nn is the number of data points, kk is the number of clusters, ll is the number of kk-means iterates and dd is the dimension of the Hyperbolic Space HH.

4.2 Approximated Hyperbolic Spectral Clustering with Poincaré kk-Means Based landmark Selection (HLS K)

Here we shortly describe another variant of the HSCA proposed with an analogy to the Approximated Spectral Clustering with kk-Means based landmark Selection, one variant of the ESCA. We assume the form of the input dataset as 𝒳:={x1,x2,,xN}n\mathcal{X}:=\{x_{1},x_{2},...,x_{N}\}\in\mathbb{R}^{n}. We also assume the number of landmark points are mm\in\mathbb{N}, where mm has to be chosen carefully such that m>>km>>k, where kk is the number of original clusters and also mm should not be very small compared to NN. [For example, if the number of samples is 50005000, and k=10k=10, we can choose m=100m=100 or 200200.] The steps of the algorithm are provided as follows: [For details we refer to the supplementary]

  1. 1.

    We obtain 𝒳:={x1,x2,,xN}H\mathcal{X^{\prime}}:=\{x_{1}^{\prime},x_{2}^{\prime},...,x_{N}^{\prime}\}\in H such that xi:=xixi+δx_{i}^{\prime}:=\frac{x_{i}}{\|x_{i}\|+\delta}, where δ>0\delta>0 small, let’s say δ=102\delta=10^{-2}.

  2. 2.

    Next we perform the initial Poincaré kk-Means [This is a kk-Means clustering with the Euclidean Distance replaced by the Poincar’e Distance] to form mm clusters, namely I1,I2,,ImI_{1},I_{2},...,I_{m} with centroids of these clusters as y1,y2,,ymy_{1},y_{2},...,y_{m} respectively.

  3. 3.

    Now we construct Vm×NV\in\mathbb{R}^{m\times N} such that V(i,j):={exp(dg(yi,xj)2σ2),ifdg(yi,xj)ϵ0,otherwise.V(i,j):=\begin{cases}exp\left(-\frac{d_{g}(y_{i},x_{j}^{\prime})^{2}}{\sigma^{2}}\right),\hskip 7.74997pt\text{if}\hskip 3.87498ptd_{g}(y_{i},x_{j}^{\prime})\leq\epsilon\\ 0,\hskip 7.74997pt\text{otherwise}.\end{cases} or     W(i,j):={exp(dg(yi,xj)2σ),if dg(yi,xj)ϵ0,otherwiseW(i,j):=\begin{cases}\exp(-\frac{d_{g}(y_{i},x_{j}^{\prime})}{2\sigma}),\hskip 5.69054pt\text{if $d_{g}(y_{i},x_{j}^{\prime})\leq\epsilon$}\\ 0,\hskip 5.69054pt\text{otherwise}\end{cases}.

  4. 4.

    Then we normalize the columns of VV, i.e., form the matrix Em×NE\in\mathbb{R}^{m\times N} with E(i,j):=V(i,j)i=1mV(i,j)E(i,j):=\frac{V(i,j)}{\sum_{i=1}^{m}V(i,j)}. Construct the diagonal matrix DE=diag(d1,d2,,dm)m×mD_{E}=diag(d_{1},d_{2},...,d_{m})\in\mathbb{R}^{m\times m} with di=:(j=1mE(i,j))1/2d_{i}=:(\sum_{j=1}^{m}E(i,j))^{-1/2}. Construct Z=DEm×NZ=DE\in\mathbb{R}^{m\times N}. Form the final matrix F:=ZtZF:=Z^{t}Z.

  5. 5.

    Finally we follow steps 393-9 of HSCA (Algorithm 1) to form the clusters.

5 Verifying The Consistency of HSCA

Following our previous notations, xx and yy are two points on the Poincaré Disc and the metric is given as

d(x,y)=2sinh1(δ(x,y)2),\displaystyle d(x,y)=2\sinh^{-1}\left(\sqrt{\frac{\delta(x,y)}{2}}\right),

where

δ(x,y)=2xy2(1x2)(1y2).\displaystyle\delta(x,y)=2\frac{\|x-y\|^{2}}{(1-\|x\|^{2})(1-\|y\|^{2})}.

The Hyperbolic Gaussian Kernel KHGK_{H_{G}} is given as

KHG(x,y)=exp(ad(x,y)2),a>0\displaystyle K_{H_{G}}(x,y)=\exp(-ad(x,y)^{2}),a>0

and the Hyperbolic Poisson Kernel KHPK_{H_{P}} is given as

KHP(x,y)=exp(ad(x,y)),a>0.\displaystyle K_{H_{P}}(x,y)=\exp(-ad(x,y)),a>0.

At first, we will prove the consistency of the HSCA using Gaussian Kernel. The proof will be quite similar if we use the Poisson Kernel. Before that we will look at a couple of results involved in the consistency proof.

Lemma 5.1.

For the usual Euclidean Gaussian Kernel given by K(x,y)=exp(axy2)K(x,y)=exp(-a\|x-y\|^{2}), we have KHG(x,y)K(x,y)K_{H_{G}}(x,y)\leq K(x,y) whenever x,yHx,y\in H.

Proof.

See Appendix A. ∎

Remark 5.1.

Lemma 5.1 also holds true for the Poisson Kernel. In step 3 we only need to use the functionf(x):=sinh1(x)x24f(x):=\sinh^{-1}(x)-\frac{x^{2}}{4} [which is true, since f(x)=11+x2x22222f^{\prime}(x)=\frac{1}{\sqrt{1+x^{2}}}-\frac{x}{2}\geq\frac{2-\sqrt{2}}{2\sqrt{2}} for 0x10\leq x\leq 1 ] instead of f(x):=sinh1(x)x2f(x):=\sinh^{-1}(x)-\frac{x}{2}. There will be an extra constant 1/21/2 in the exponent, but this will not affect the proof of Lemma 5.4 presented later.

Remark 5.2.

KHGK_{H_{G}} is radial: If we fix one variable, let’s say yy at 0, then δ(x,0)=2x21x2\delta(x,0)=2\frac{\|x\|^{2}}{1-\|x\|^{2}}, which is a radial function. Therefore, the Poincaré Metric is also radial, so is the Hyperbolic Gaussian Kernel.

Lemma 5.2.

The hyperbolic Gaussian Kernel KHGL1(H)K_{H_{G}}\in L^{1}(H), i.e. this kernel is an absolutely integrable.

Proof.

See Appendix A. ∎

Terminology:

For a compact subset Ωn\Omega\in\mathbb{R}^{n} [with 0 in its interior], we call Ω\Omega to be symmetric if for every xΩx\in\Omega and for every MSOn(n)M\in SO_{n}(\mathbb{R}^{n}), we have MxΩMx\in\Omega.

Lemma 5.3.

Suppose Ωn\Omega\in\mathbb{R}^{n} is symmetric, fL1(Ω)f\in L^{1}(\Omega) and ff is radial. Then its Fourier Transform is also radial.

Proof.

See Appendix A. ∎

Next we intend to use Theorem 3 [32] and this necessitates computing the Fourier Transform k^(w)\widehat{k}(w) of kHG(x)k_{H_{G}}(x) and will show that k^\widehat{k} decays exponentially.

Lemma 5.4.

There exist C,l>0C,l>0 such that k^(w)Cexp(l|w|)\widehat{k}(w)\leq C\exp(-l|w|) for all wnw\in\mathbb{R}^{n}.

Proof.

See Appendix A. ∎

Terminology and Definitions:

Let HH be the compact subset of the Poincaré Disc as defined above. k:H×Hk:H\times H\to\mathbb{R} be the similarity function with the Gaussian Kernel equipped with the Poincaré Metric. Let h:H×Hh:H\times H\to\mathbb{R} be the normalized similarity function. Then for any continuous function g𝒞(H)g\in\mathcal{C}(H), we define the following [as in section 6 [33]:

𝒦\displaystyle\mathcal{K} :={k(x,):xH},\displaystyle:=\{k(x,\cdot):x\in H\},
\displaystyle\mathcal{H} :={h(x,):xH},\displaystyle:=\{h(x,\cdot):x\in H\},
g\displaystyle g\cdot\mathcal{H} :={g(x)h(x,):xH},\displaystyle:=\{g(x)h(x,\cdot):x\in H\},
and\displaystyle\text{and}\mathcal{H}\cdot\mathcal{H} :={h(x,)h(x,),xH}.\displaystyle:=\{h(x,\cdot)h(x,\cdot),x\in H\}.

We also define :=𝒦(g)()\mathcal{F}:=\mathcal{K}\cup(g\cdot\mathcal{H})\cup(\mathcal{H}\cdot\mathcal{H}). Now we will re-write Theorem 19[33] with a slightly modified proof.

Theorem 5.1.

Let (H,𝒜,P)(H,\mathcal{A},P) be a probability space with 𝒜\mathcal{A} being any arbitrary sigma algebra on HH. Let \mathcal{F} be defined as above with f1\|f\|_{\infty}\leq 1. Let XnX_{n} be a sequence of i.i.d. random variables drawn according to the distribution PP and PnP_{n} be the corresponding emperical diustributions. Then there exists a constant c>0c>0 such that for all nn\in\mathbb{N} with probability at least δ\delta,

supf|PnfPf|\displaystyle\sup_{f\in\mathcal{F}}|P_{n}f-Pf| cn0log(𝒩,ϵ,L2(Pn))𝑑ϵ\displaystyle\leq\frac{c}{\sqrt{n}}\int_{0}^{\infty}\sqrt{\log(\mathcal{N},\epsilon,L^{2}(P_{n}))}d\epsilon
+12nlog(2δ),\displaystyle+\sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)},

where 𝒩\mathcal{N} is the covering number of the space HH with ball of radius ϵ\epsilon with respect to the metric L2(Pn)L^{2}(P_{n}). Hence the rate of convergence of the Hyperbolic Spectral Clustering is 𝒪(1n)\mathcal{O}\left(\frac{1}{\sqrt{n}}\right).

Proof.

See Appendix A. ∎

Remark 5.3.

Note that in deriving the convergence rate of the hyperbolic spectral clustering, we used results used mostly in the proof of the consistency of spectral clustering in the Euclidean set-up. The hyperbolic metric is in general very powerful compared to the squared euclidean metric, which forces the hyperbolic Gaussian/Poisson Kernel converging to 0 much faster than the Euclidean ones. Therefore, we believe the convergence rate of the hyperbolic spectral clustering can be improved, which requires estimating a careful bound on the logarithmic covering number with respect to the hyperbolic metric.

6 Experiments and Results

6.1 Description of the Datasets:

To examine the clustering efficiency of our proposed algorithm, we will use a combination of total 77 datasets, among these 44 are real datasets Airport, Glass, Zoo and Wisconsin and 33 are synthetic datasets 2d-20c-no0, D31, and st900. The ground cluster levels are not available in case of the Airport dataset. Details of the datasets are shown in Table 1. Since the Airport dataset is too large, we took random samples of sizes 10001000, 20002000, 50005000 and 1000010000 four times to evaluate the clustering efficiency assuming the number of ground clusters being 55.

Table 1: Intrinsic Evaluation Metrics for Datasets
Datasets No. of Samples Dimension Number of Clusters
Airport 1048575 196 -
Wisconsin 699 9 2
Glass 214 9 6
Zoo 101 16 7
2d-20c-no0 1517 2 20
st900 900 2 9
D31 3100 2 31
Table 2: Intrinsic Evaluation Metrics for Datasets
Methods Airport(1000) Airport(2000)
S. Score D.B. Score C.H. Index S. Score D.B. Score C.H. Index
ESCA(G) 0.68 0.75 78.62 0.83 0.12 189.66
HSCA(G) 0.24 0.31 1114.77 0.21 0.32 2023.64
Improvement 64.7% (\downarrow) 44%(\uparrow) 1317.9%(\uparrow) 74.7%(\downarrow) 166.7%(\downarrow) 967.0%(\uparrow)
ESCA(P) 0.70 0.40 140.02 0.82 0.40 93.31
HSCA(P) 0.21 0.32 1111.89 0.17 0.34 2024.71
Improvement 70.0% (\downarrow) 20%(\uparrow) 694.1%(\uparrow) 79.27%(\downarrow) 17.7%(\uparrow) 2069.87%(\uparrow)
Methods Airport(5000) Airport(10000)
S. Score D.B. Score C.H. Index S. Score D.B. Score C.H. Index
ESCA(G) 0.57 0.56 144.84 0.83 0.46 218.53
HSCA(G) 0.24 0.32 5299.59 0.24 0.33 10584.5
Improvement 57.9% (\downarrow) 42.9%(\uparrow) 3558.9%(\uparrow) 71.1%(\downarrow) 28.3%(\uparrow) 4755.1%(\uparrow)
ESCA(P) 0.57 0.56 144.83 0.83 0.26 218.53
HSCA(P) 0.21 0.34 5258.51 0.21 0.35 10428.3
Improvement 63.2% (\downarrow) 39.3%(\uparrow) 3531.1%(\uparrow) 74.7%(\downarrow) 34.6%(\downarrow) 4672.1%(\uparrow)
Methods Wisconsin Breast Cancer Glass
S. Score D.B. Score C.H. Index S. Score D.B. Score C.H. Index
ESCA(G) 0.41 0.49 3.44 0.53 0.28 9.94
HSCA(G) 0.01 0.91 131.54 0.42 0.10 111.66
Improvement 97.56% (\downarrow) 85.7%(\downarrow) 3723.8%(\uparrow) 20.8%(\downarrow) 64.3%(\uparrow) 1023.3%(\uparrow)
ESCA(P) 0.46 0.40 5.12 0.57 0.26 10.56
HSCA(P) 0.27 0.65 219.06 0.28 0.12 110.72
Improvement 41.3% (\downarrow) 62.5%(\downarrow) 4280.1%(\uparrow) 50.1%(\downarrow) 53.8%(\uparrow) 948.5%(\uparrow)
Methods Zoo 2d-20c-no0
S. Score D.B. Score C.H. Index S. Score D.B. Score C.H. Index
ESCA(G) 0.22 0.64 13.51 0.54 0.64 1953.87
HSCA(G) 0.32 0.16 49.62 0.47 0.16 9630.12
Improvement 45.5% (\uparrow) 75.0%(\uparrow) 267.3%(\uparrow) 12.9%(\downarrow) 75.0%(\uparrow) 393.9%(\uparrow)
ESCA(P) 0.23 0.96 5.94 0.35 0.71 2.27
HSCA(P) 0.35 0.17 50.85 0.54 0.10 8643.90
Improvement 52.2% (\uparrow) 82.3%(\uparrow) 756.1%(\uparrow) 54.3%(\uparrow) 85.9%(\uparrow) 38.1×10538.1\times 10^{5}%(\uparrow)
Methods st900 D31
S. Score D.B. Score C.H. Index S. Score D.B. Score C.H. Index
ESCA(G) -0.03 0.68 75.31 0.39 0.56 2365.22
HSCA(G) 0.47 0.86 1548.80 0.17 0.85 2102.17
Improvement 1667.7% (\uparrow) 26.5%(\downarrow) 1956.6%(\uparrow) 56.4%(\downarrow) 51.8%(\downarrow) 11.12%(\uparrow)
ESCA(P) -0.03 0.67 75.28 0.17 0.95 185.07
HSCA(P) 0.49 0.63 2388.67 0.50 0.57 69144.07
Improvement 1733.3% (\uparrow) 6.0%(\uparrow) 3073.1%(\uparrow) 194.1%(\uparrow) 40.0%(\uparrow) 37261%(\uparrow)

6.2 Performance Measure and Validity Index:

When the true cluster labels of the datasets are not known, we will use three intrinsic cluster performance measures to compare the Euclidean Spectral Clustering algorithm with the Hyperbolic Spectral Clustering Algorithm. We will use three main intrinsic evaluation metrics, namely

  • Silhouette Coefficient [S. Score] [34]

  • Davies-Bouldin score [D.B. Score] [35]

  • Calinski–Harabasz index [C.H. Index] [36]

Also, for simulated datasets, we know the true cluster labels and the number of clusters. There we will compare ESCA and HSCA by the extrinsic evaluation metrics, namely

  • Adjusted Rand Index (ARI) [37]

  • Normalized Mutual Information (NMI) [38]

Refer to caption
(a) Airport(10000)
Refer to caption
(b) A(10000)-ESCA Clusters
Refer to caption
(c) A(10000)-HSCA clusters
Figure 4: t-SNE Visualization of the Airport Dataset and Clusters
Refer to caption
(d) Dataset Zoo
Refer to caption
(e) Zoo-ESCA Clusters
Refer to caption
(f) Zoo-HSCA Clusters
Figure 5: t-SNE Visualization of the Zoo Dataset and Clusters
Refer to caption
(a) Dataset st900
Refer to caption
(b) st900-ESCA Clusters
Refer to caption
(c) st900-HSCA clusters
Figure 6: t-SNE Visualization of the st900 Dataset and Clusters
Table 3: Extrinsic Evaluation Metrics for Datasets for the variants of ESCA and ELSC-K
Methods Wisconsin Glass Zoo 2d-20c-no0 st900 D31
ARI NMI ARI NMI ARI NMI ARI NMI ARI NMI ARI NMI
ESCA(G) 0.01 0.01 0.01 0.06 0.26 0.37 0.67 0.86 0.16 0.35 0.56 0.84
HSCA(G) 0.77 0.66 0.23 0.36 0.53 0.70 0.76 0.87 0.72 0.76 0.22 0.60
Improvement 7600% (\uparrow) 6500%(\uparrow) 2200%(\uparrow) 500%(\uparrow) 104%(\uparrow) 89%(\uparrow) 13%(\uparrow) 1.1%(\uparrow) 350%(\uparrow) 117%(\uparrow) 60%(\downarrow) 29%(\downarrow)
ESCA(P) 0.01 0.01 0.01 0.05 0.14 0.34 0.12 0.42 0.17 0.35 0.09 0.45
HSCA(P) 0.16 0.24 0.25 0.38 0.57 0.76 0.60 0.82 0.63 0.71 0.29 0.63
Improvement 1500%(\uparrow) 2300%(\uparrow) 2400%(\uparrow) 660%(\uparrow) 307%(\uparrow) 124%(\uparrow) 400%(\uparrow) 95%(\uparrow) 354%(\uparrow) 103%(\uparrow) 222%(\uparrow) 40%(\uparrow)
ELSC-K(G) 0.82 0.72 0.23 0.40 0.70 0.78 0.54 0.79 0.72 0.78 0.94 0.96
HLSC-K(G) 0.84 0.73 0.25 0.39 0.71 0.78 0.55 0.80 0.75 0.81 0.95 0.97
Improvement 2.4%(\uparrow) 1.4%(\uparrow) 8.7%(\uparrow) 2.5%(\downarrow) 1.4%(\downarrow) 0.00%(-) 1.9%(\uparrow) 1.3%(\uparrow) 4.2%(\uparrow) 3.8%(\uparrow) 1%(\uparrow) 1%(\uparrow)
ELSC-K(P) 0.86 0.76 0.25 0.39 0.49 0.70 0.54 0.79 0.72 0.80 0.95 0.96
HLSC-K(P) 0.87 0.78 0.27 0.41 0.79 0.77 0.53 0.78 0.69 0.73 0.96 0.98
Improvement 1.1%(\uparrow) 2.6%(\uparrow) 8%(\uparrow) 2.5%(\uparrow) 5.1%(\uparrow) 10%(\uparrow) 1.9%(\downarrow) 1.3%(\downarrow) 4.2%(\downarrow) 8.75%(\downarrow) 1%(\uparrow) 1%(\uparrow)
FESC(G) 0.01 0.01 0.01 0.07 0.12 0.23 0.25 0.73 0.11 0.21 0.02 0.28
FHSC(G) 0.05 0.02 0.02 0.04 0.36 0.52 0.13 0.39 0.24 0.48 0.14 0.67
Improvement 400%(\uparrow) 100%(\uparrow) 100%(\uparrow) 43% (\uparrow) 200%(\uparrow) 126%(\uparrow) 48%(\downarrow) 47%(\downarrow) 118%(\uparrow) 129%(\uparrow) 600%(\uparrow) 139%(\uparrow)
FESC(P) 0.01 0.01 0.03 0.07 0.11 0.24 0.16 0.44 0.11 0.22 0.10 0.38
FHSC(P) 0.84 0.75 0.25 0.37 0.68 0.77 0.37 0.79 0.70 0.78 0.94 0.96
Improvement 8300%(\uparrow) 7400%(\uparrow) 733%(\uparrow) 429% (\uparrow) 518%(\uparrow) 221%(\uparrow) 131%(\uparrow) 79%(\uparrow) 536%(\uparrow) 254%(\uparrow) 840%(\uparrow) 153%(\uparrow)
LRR -0.01 0.00 -0.03 0.12 0.01 0.07 -0.00 0.04 0.00 0.04 0.00 0.03

6.3 Effectiveness of HSCA:

We now assess the performance of our proposed HSCA against the following well-known versions of the Spectral Clustering algorithms [with the Gaussian (G) and Poisson (P) kernels]:

  1. 1.

    Euclidean Spectral Clustering Algorithm [ESCA] [3]

  2. 2.

    Fast Euclidean Spectral Clustering Algorithm [18] [we have also developed a hyperbolic version of FESC, which we named as Fast Hyperbolic Spectral Clustering Algorithms (FHSC) analogously to the HSCA and included in our comparison]

  3. 3.

    Low Rank Representation Clustering (LRR) [19]

  4. 4.

    Approximated spectral clustering with kk-means-based landmark selection (ELSC-K) [39] [we have also compared it with the respective hyperbolic variants HLSC-K with respect to the Gaussian(G) and Poisson Kernels(P)]. See Supplementary for a pseudo-code of the algorithm.

6.4 Discussions:

When analyzing the outcomes presented in Tables 2 and 3, let’s delve into the alterations in the evaluation metrics resulting from running the variants of the HSCA as specified. In Table 2, we’ve outlined the outcomes from executing the variants of ESCA alongside their corresponding hyperbolic versions. We will handle the Airport dataset separately since there are no ground cluster levels available for it. Furthermore, the Airport dataset is categorical, inheriting a hierarchical structure. When a dataset includes a hierarchy, the initial transformation of HSCA embeds the dataset closer to a similar hierarchy compared to other points appearing in different hierarchies. Consequently, the hyperbolic distance between points in similar hierarchies becomes significantly smaller compared to distances between points in different hierarchies, where Euclidean Spaces exhibit poor performance. Once the embedding process is complete, the remainder of HSCA categorizes the points almost according to their hierarchy.

6.4.1 Airport Dataset:

If we compare the obtained numerical results with the visual clusters depicted through the t-SNE plot in Fig. 6, we observe that for the Airport dataset, the Silhoutte Score have been significantly decreased in the HSCA compared to the ESCA. This is primarily due to the fact that both the ESCA(G) and ESCA(P) have given one predominant clusters, whereas the HSCA clusters have given 55 distinct clusters. Since the Silhoutte Score increases if the mean distance between clusters in the same dataset decreases and the mean smallest distance from the other clusters increases, the ESCA clusters are expected to give higher Silhoutte Score compared to the HSCA clusters simply because there are more distinct clusters formed in HSCA. On the other hand, the Davies-Bouldin score depends on the distances between the centroids of the same cluster vs the distances between cluster centroids in distinct clusters. Unlike the Silhoutte Score, there is a drastic change between the values in the Davies-Bouldin Scores compared to the Euclidean with respect to the Hyperbolic, formation of more distinct clusters along with the incorporated the hyperbolic metric to compute the particluar score have resulted in the drastic change. A good cluster is generally charecterized by the higher Silhoutte Score (closer to 1), lower Davies-Bouldin Score (preferably lower than 1) but high Calinski-Harabasz Score. Aparently HSCA performs poorly compared to the ESCA if the first two metrics are considered, but the Calinski-Harabasz Indices have gone significantly high in the case of HSCA, this is again due to incorporating the hyperbolic metric and the nature of distribution that the index follows. Since the ground level clusters are not available for this dataset and it has some inherent hierarchical structure, we can conclude the HSCA has surmounted the ESCA by giving more distinct clusters.

Upon scrutinizing the numerical outcomes alongside the visual representations of clusters portrayed in the t-SNE plot in Fig. 6, a notable trend emerges within the Airport dataset: the Silhouette Score experiences a significant decline in the HSCA compared to the ESCA. This phenomenon can be attributed to the fact that both the ESCA(G) and ESCA(P) methods yield one predominant cluster, whereas the HSCA approach generates five distinct clusters. Given that the Silhouette Score is contingent upon a reduction in the mean distance between clusters within the same dataset and an increase in the mean smallest distance from other clusters, it follows that ESCA clusters are predisposed to higher Silhouette Scores relative to HSCA clusters due to the greater number of distinct clusters formed by HSCA. Conversely, the Davies-Bouldin score hinges on the distances between centroids within the same cluster versus the distances between cluster centroids in distinct clusters. In contrast to the Silhouette Score, there is a notable disparity in the Davies-Bouldin Scores between the Euclidean and Hyperbolic scenarios, resulting from the formation of more distinct clusters and the incorporation of the hyperbolic metric in computing the score. A desirable cluster is typically characterized by a higher Silhouette Score (closer to 1), a lower Davies-Bouldin Score (preferably below 1), and a high Calinski-Harabasz Score. Although HSCA appears to underperform compared to ESCA based on the first two metrics, the Calinski-Harabasz Indices show a significant increase in the case of HSCA. This improvement is once again attributed to the incorporation of the hyperbolic metric and the distribution nature that the index follows. Despite the unavailability of ground-level clusters for this dataset and its inherent hierarchical structure, we can infer that HSCA outperforms ESCA by yielding more distinct clusters.

6.4.2 Other Real Datasets: Wisconsin, Glass & Zoo

Here we examine the effects of HSCA versus ESCA for the Wisconsin Breast Cancer, Glass, and Zoo datasets. In all these three cases, HSCA variants outperform ESCA clusters [see 6 and the supplementary document]. This is primarily attributed to the datasets possessing some form of hierarchical structure. When embedding them into Euclidean Spaces, poor cluster formations arise. Notably, only the Approximated Spectral Clustering with kk-means based landmark selection (ELSC-K) exhibits superior performance compared to standard Euclidean Spectral Clustering. This can be attributed to the kk-means based landmark selection, which captures the inherent hierarchy to some extent, leading to improved spectral clusterings. However, Hyperbolic variants of ELSC-K still outperform, as the hyperbolic space is better suited for embedding hierarchical data.

6.4.3 Synthetic Datasets: 2d-20c-no0, st900 & D15

The similar clustering forms also work for these datasets. The Poincaré embedding takes care of the part that the points in the same hierarchy will remain close. Here we also observe that the variants of HSCA perform way better than the corresponding variants of the ESCA as expected. The ESCA and HSCA clusters formed on the st-900 dataset have been shown in Figure 6. For other synthetic datasets, we refer to the supplementary.

6.5 Ablation Experiments

The kernel hyperparameter σ\sigma requires tuning to achieve optimal performance. In our experiments, we set σ\sigma to a small value (approximately 0.10.1), which has been found to be optimal for most variants of the HSCA. We have observed that there is minimal fluctuation in the ARi or NMi values when the hyperparameter values (1σ2\frac{1}{\sigma^{2}} for the Gaussian Kernel and 12σ\frac{1}{2\sigma} for the Poisson Kernel) are very small (less than 0.10.1). These values exhibit bounded variation within a small range when the hyperparameter is very small. Our approach for this analysis follows a similar methodology to that presented in [40]. For detailed graphs illustrating the tuning process, please refer to Figures 7 and 8.

The clusters related to other datasets can be found in Section Experiments and Results of the supplementary document.

Refer to caption
(a) WBCD
Refer to caption
(b) Glass
Refer to caption
(c) Zoo
Refer to caption
(d) 2d-20c-no0
Refer to caption
(e) st900
Refer to caption
(f) D31
Figure 7: Visualization of ARI Values with respect to the hyperparameter 1σ2\frac{1}{\sigma^{2}} (for Gaussian) and 12σ\frac{1}{2\sigma} (for Poisson)
Refer to caption
(a) WBCD
Refer to caption
(b) Glass
Refer to caption
(c) Zoo
Refer to caption
(d) 2d-20c-no0
Refer to caption
(e) st900
Refer to caption
(f) D31
Figure 8: Visualization of NMI Values with respect to the hyperparameter 1σ2\frac{1}{\sigma^{2}} (for Gaussian) and 12σ\frac{1}{2\sigma} (for Poisson)

7 Conclusion

In this paper, we highlight the limitations of Euclidean space in conveying meaningful information and representing arbitrary tree or graph-like structures. We emphasize the inability of Euclidean spaces with low dimensions to embed such structures while preserving the associated metric. Conversely, Hyperbolic Spaces offer a compelling solution, even in shallow dimensions, to efficiently represent such data and yield superior clustering results compared to Euclidean Spaces. Our primary contributions include proposing a spectral clustering algorithm tailored for hyperbolic spaces, where a hyperbolic similarity matrix replaces the traditional Euclidean one. Additionally, we provide theoretical analysis on the weak consistency of the algorithm, showing convergence at least as fast as spectral clustering on Euclidean spaces. We also introduce hyperbolic versions of well-known Euclidean spectral clustering variants, such as Fast Spectral Clustering with approximate eigenvectors (FastESC) and Approximate Spectral Clustering with kk-means-based Landmark Selection.

Our approach to clustering data in hyperbolic space via spectral decomposition of the modified similarity matrix reveals a more effective method for clustering complex hierarchical data compared to Euclidean Spaces. Notably, this algorithm converges at least as rapidly as Spectral Clustering on Euclidean Spaces without introducing additional computational complexity. Moreover, variants of Hyperbolic Spectral Clustering exhibit superior adequacy and performance over their Euclidean counterparts. We anticipate that this method will expedite significant advancements in non-deep machine learning algorithms.

Availability of Datasets and Codes:

References

8 Appendix

9 Proofs of Lemmas and Theorem from Section 5

9.1 Proof of Lemma 5.1

Proof.

Step 1: We have x,y<1\|x\|,\|y\|<1, hence x2,y2<1(1x2)(1y2)<1\|x\|^{2},\|y\|^{2}<1\implies(1-\|x\|^{2})(1-\|y\|^{2})<1. Hence we can write

xy2(1x2)(1y2)>xy2.\displaystyle\frac{\|x-y\|^{2}}{(1-\|x\|^{2})(1-\|y\|^{2})}>\|x-y\|^{2}.

But we also have

δ(x,y)=2xy2(1x2)(1y2).\displaystyle\delta(x,y)=2\frac{\|x-y\|^{2}}{(1-\|x\|^{2})(1-\|y\|^{2})}.

Combining this with the last inequality, we get

δ(x,y)>2xy2.\displaystyle\delta(x,y)>2\|x-y\|^{2}.

Step 2: For xx\in\mathbb{R}, ddx(sinh1(x))=11+x2>0\frac{d}{dx}(\sinh^{-1}(x))=\frac{1}{\sqrt{1+x^{2}}}>0. Therefore the inverse sine hyperbolic function is a strictly increasing function of xx. By Step 1, we have δ(x,y)>2xy2\delta(x,y)>2\|x-y\|^{2}. Therefore, we have δ(x,y)2xy2\frac{\delta(x,y)}{2}\geq\|x-y\|^{2}. This also implies sqrtδ(x,y)2xy\\ sqrt{\frac{\delta(x,y)}{2}}\geq\|x-y\|.
Since d(x,y)=2sinh1(δ(x,y)2)d(x,y)=2sinh^{-1}\left(\sqrt{\frac{\delta(x,y)}{2}}\right) and the inverse sine hyperbolic function is increasing, we can write

d(x,y)2sinh1(xy)\displaystyle d(x,y)\geq 2\sinh^{-1}(\|x-y\|)

We know that for 0<s<t0<s<t, exp(s)>exp(t)exp(-s)>exp(-t). This enables us to write

KHG(x,y)\displaystyle K_{H_{G}}(x,y) =exp(ad(x,y)2)\displaystyle=exp(-ad(x,y)^{2})
exp(4a[sinh1(xy)]2).\displaystyle\leq exp(-4a[sinh^{-1}(\|x-y\|)]^{2}).

Step 3: Note that for 0x10\leq x\leq 1, 11+x212\frac{1}{\sqrt{1+x^{2}}}\geq\frac{1}{\sqrt{2}}. Let f(x):=sinh1(x)x2f(x):=\sinh^{-1}(x)-\frac{x}{2}. Then ff is differentiable and we get f(x)=11+x2122222f^{\prime}(x)=\frac{1}{\sqrt{1+x^{2}}}-\frac{1}{2}\geq\frac{2-\sqrt{2}}{2\sqrt{2}}. Therefore ff is increasing on [0,1][0,1] and for 0x10\leq x\leq 1, sinh1(x)x2\sinh^{-1}(x)\geq\frac{x}{2}. Hence exp(sinh1(xy2))exp(xy22)\exp(-\sinh^{-1}(\|x-y\|^{2}))\leq\exp\left(-\frac{\|x-y\|^{2}}{2}\right). Therefore following step 2, we get

KHG(x,y)\displaystyle K_{H_{G}}(x,y) exp(4a[sinh1(xy)]2)\displaystyle\leq\exp(-4a[\sinh^{-1}(\|x-y\|)]^{2})
exp(4axy24)\displaystyle\leq\exp\left(-4a\frac{\|x-y\|^{2}}{4}\right)
=exp(axy2)=K(x,y),\displaystyle=\exp(-a\|x-y\|^{2})=K(x,y),

9.2 Proof of Lemma 5.2

Proof.

KHG(x)=KH(x,0)K(x,0)=exp(ax2)K_{H_{G}}(x)=K_{H}(x,0)\leq K(x,0)=\exp(-a\|x\|^{2}) [by Lemma 5.1 and Remark 5.2]. Therefore following step 3 of Lemma 5.1 we write,

H|KHG(x)|𝑑x\displaystyle\int_{H}|K_{H_{G}}(x)|dx H|exp(ax2)|𝑑x\displaystyle\leq\int_{H}|\exp(-a\|x\|^{2})|dx
Hexp(ax2)𝑑x\displaystyle\int_{H}\exp(-a\|x\|^{2})dx
nexp(ax2)𝑑x\displaystyle\leq\int_{\mathbb{R}^{n}}\exp(-a\|x\|^{2})dx
<\displaystyle<\infty

as HH is only a subset of the unit ball in n\mathbb{R}^{n}. The last integral is finite since the integrand is the usual Gaussian distribution. ∎

9.3 Proof of Lemma 5.3

Proof.

ff is radial if and only if for every MSOn(n)M\in SO_{n}(\mathbb{R}^{n}) [where SOn(n)SO_{n}(\mathbb{R}^{n}) is the special unitary group on n\mathbb{R}^{n}, i.e. consisting of all n×nn\times n matrices over \mathbb{R} with determinant 11], f(Mx)=f(x)f(Mx)=f(x) [as the operation xMxx\to Mx only rotates xx, does not change its magnitude, i.e. Mx=x\|Mx\|=\|x\|]. Then for any arbitrary MSOn(n)M\in SO_{n}(\mathbb{R}^{n})

f^(Mt)\displaystyle\widehat{f}(Mt) =Ωf(x)ei<Mt,x>𝑑x\displaystyle=\int_{\Omega}f(x)e^{-i<Mt,x>}dx
=M(Ω)f(Ms)ei<Mt,Ms>𝑑s\displaystyle=\int_{M(\Omega)}f(Ms)e^{-i<Mt,Ms>}ds\hskip 7.74997pt
[change of variable xMs]\displaystyle[\text{change of variable $x\to Ms$}]
=Ωf(s)ei<t,s>𝑑s[since Ω is symmetric]\displaystyle=\int_{\Omega}f(s)e^{-i<t,s>}ds\hskip 7.74997pt[\text{since $\Omega$ is symmetric}]
=f^(t),\displaystyle=\widehat{f}(t),

where the second equality follows from the fact that we get by the conjugate linearity of the inner product: <Mt,Ms>=<MMt,s><Mt,Ms>=\\ <M^{*}Mt,s> =<t,s>=<t,s> since MM=InM^{*}M=I_{n} [MSOn(n)M\in SO_{n}(\mathbb{R}^{n})]. ∎

9.4 Proof of Lemma 5.4

Proof.

Let f(x)=KHG(x)=exp(ad(x,0)2)f(x)=K_{H_{G}}(x)=exp(-ad(x,0)^{2}). Then by Lemma 5.1, we have f(x)exp(ax2)f(x)\leq exp(-a\|x\|^{2}) for all xHx\in H. Exploiting fully the fact that k^\widehat{k} is radial (and hence real valued), we get

|K^(w)|\displaystyle|\widehat{K}(w)| =|Hf(x)eiwtx𝑑x|\displaystyle=|\int_{H}f(x)e^{-iw^{t}x}dx|
=|Hf(x)eax2eax2eiwtx𝑑x|\displaystyle=|\int_{H}f(x)e^{a\|x\|^{2}}e^{-a\|x\|^{2}}e^{-iw^{t}x}dx|
H|f(x)eax2eax2eiwtx|𝑑x\displaystyle\leq\int_{H}|f(x)e^{a\|x\|^{2}}e^{-a\|x\|^{2}}e^{-iw^{t}x}|dx
H|eax2eiwtx|𝑑x\displaystyle\leq\int_{H}|e^{-a\|x\|^{2}}e^{-iw^{t}x}|dx\hskip 7.74997pt
[This is just Fourier Transform of the Euclidean\displaystyle[\text{This is just Fourier Transform of the Euclidean}
Gaussian Kernel over H]\displaystyle\text{Gaussian Kernel over $H$}]
n|eax2eiwtx|𝑑x\displaystyle\leq\int_{\mathbb{R}^{n}}|e^{-a\|x\|^{2}}e^{-iw^{t}x}|dx
Cepw2\displaystyle\leq C^{\prime}e^{-p\|w\|^{2}}
Cexp(lw),\displaystyle\leq Cexp(-l\|w\|),

where CC^{\prime} and CC are some appropriately chosen constants. ∎

9.5 Proof of Theorem 5.1

Proof.

Combining Lemma 5.4 and Theorem 3 [32] we get

log(𝒩(,ϵ,))C0log(1ϵ)d+1,\displaystyle\log(\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}))\leq C_{0}\log\left(\frac{1}{\epsilon}\right)^{d+1},

for some constant C0C_{0} chosen appropriately and dd is the dimension of HH. Since dd is a constant for HH, we can write the above inequality as

log(𝒩(,ϵ,))C1log(1ϵ)2.\displaystyle\log(\mathcal{N}(\mathcal{F},\epsilon,\|\cdot\|_{\infty}))\leq C_{1}\log\left(\frac{1}{\epsilon}\right)^{2}.

Following the same sequence of computation as in Theorem 19[33], we get

0log(𝒩,ϵ,L2(Pn))𝑑ϵ<\displaystyle\int_{0}^{\infty}\sqrt{\log(\mathcal{N},\epsilon,L^{2}(P_{n}))}d\epsilon<\infty

Hence following Theorem 19[33] we write

supf|PnfPf|\displaystyle\sup_{f\in\mathcal{F}}|P_{n}f-Pf| cn0log(𝒩,ϵ,L2(Pn))𝑑ϵ\displaystyle\leq\frac{c}{\sqrt{n}}\int_{0}^{\infty}\sqrt{\log(\mathcal{N},\epsilon,L^{2}(P_{n}))}d\epsilon
+12nlog(2δ)\displaystyle+\sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)}
<C1n+12nlog(2δ),\displaystyle<\frac{C_{1}}{\sqrt{n}}+\sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)},

for some appropriately chosen constant C1C_{1}. Since δ>0\delta>0 we get,

supf|PnfPf|C(1n).\displaystyle\sup_{f\in\mathcal{F}}|P_{n}f-Pf|\leq C\left(\frac{1}{\sqrt{n}}\right).

Finally Finally, combining theorem 16 of [33] with the last inequality, we get

supf|PnfPf|=𝒪(1n).\displaystyle\sup_{f\in\mathcal{F}}|P_{n}f-Pf|=\mathcal{O}\left(\frac{1}{\sqrt{n}}\right).

System Specification:

The experiments were carried out on a personal computer with 1212th Gen Intel(R) Core(TM) i5-1230U 1.00 GHz Processor, 16 GB RAM, Windows 1111 Home 2222H22 and Python 3.11.53.11.5.