This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Contrastive Graph Learning with Clustering-Oriented Guidance

Mulin Chen1,2, Bocheng Wang1,2, Xuelong Li1,2 Corresponding author.
Abstract

Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms.

Introduction

Graph-based clustering achieves dominant performance due to its capability on capturing data manifold. It uses an undirected graph to learn the pair-wise relations between samples, and infers the cluster labels from the adjacency structure. Recently, neural networks are widely used in data clustering (Xie, Girshick, and Farhadi 2016; Shaham et al. 2018; Bo et al. 2020; Ren et al. 2022). Among a variety of network structures, Graph Convolutional Network (GCN) (Kipf and Welling 2016a) has shown a prominent ability on exploring the relationship among graph-type data, and realized good results on graph-based clustering (Zhang et al. 2022).

GCN-based clustering methods learn the data representation with the input graph and data matrix. For general clustering tasks, there is no prior graph that records the relationships of samples. Hence, an initial graph needs to be constructed as the input of GCN. Although this approach has demonstrated promising results, it gives rise to some issues. Firstly, as observed in the Graph Auto-Encoder (GAE) framework (Li, Zhang, and Zhang 2021; Zhao et al. 2022) and a series of graph-structured data clustering models (Wang et al. 2019; Tu et al. 2021; Liu et al. 2022c), the training procedure relies on the input graph heavily, and pays insufficient attention to the original features. Therefore, if there are many connected samples from different classes in the initial graph, the embedding obtained by multilayer aggregation would be ambiguous, leading to a loss of discriminative capacity. This phenomenon is called as representation collapse. Secondly, GCN-based models emphasize the representational ability of the learned graph embedding, but lack clustering-oriented guidance, such as the distribution of centroids. As a result, the output of GCN may contain clustering-irrelevant information, while the cluster structure is not fully captured, which is essential for data clustering.

Refer to caption
Figure 1: Overview of DCGL. Note that the decoder for 𝐇v2\mathbf{H}^{v_{2}} is omitted in this pipeline. First, we construct an initial graph to enable GCN. Second, we generate two latent representations by the pseudo-siamese network. Third, we derive two adjacency graphs from different perspectives, and then compute their cluster-level graph embeddings. Afterward, the framework is updated with the joint loss shown in Eq. (15). The final result is obtained by performing spectral clustering on the converged 𝐒L\mathbf{S}^{L}.

To address these issues, we propose the Deep Contrastive Graph Learning (DCGL) model for the general clustering task without a prior graph. As shown in Fig. 1, DCGL consists of a pseudo-siamese network, wherein one branch handles both the data matrix and the input graph with GCN, while the other branch encodes the data matrix with an encoder. During this stage, feature-level contrastive learning models the distances between the estimated centroids and samples, such that the discriminative characteristics are explored. Then, the outputs are employed to construct the Local Propinquity Graph (LPG) and Global Diffusion Graph (GDG) with different neighborhood sizes respectively. After that, the graphs are mapped into the cluster space, where cluster-level contrastive learning is used to push away the different centroids. In this way, the cluster structure is better maintained. The main contributions of this paper are summarized as follows:

  • A deep unsupervised graph learning model is established to cluster general data, which takes a pseudo-siamese network to analyze structural dependencies and node attributes concurrently. By considering the original features explicitly and updating the learned graph adaptively, the reliance on the initial graph is alleviated.

  • Two clustering-oriented contrastive learning strategies are devised to provide training guidance. Feature-level contrastive learning aims to preserve the discriminability of each node, and cluster-level contrastive learning promotes a clear distribution of centroids.

  • The graph learning procedure is divided into two branches to capture the data manifold from local and global perspectives respectively. Both the local and global graphs are mapped into the cluster space to search a unified cluster structure.

Related Work

Graph Structure Learning

Graph learning is an important topic in many practical applications (Liu et al. 2023; Li 2022a, b). KK-Nearest Neighbor (KNN) is the most traditional graph learning way. This strategy builds the adjacency relation based on the sample distance, and then calculates the weight by a specific formula (Cai et al. 2010; Li et al. 2017a, b). The graph from simple KNN reserves limited manifold information, so some advanced graph learning theories are raised, such as adaptive neighbor learning (Nie, Wang, and Huang 2014; Kang et al. 2019), self-expression (Elhamifar and Vidal 2013; Kang, Peng, and Cheng 2017), and consistency propagation (Huang et al. 2022; Tan et al. 2023).

In recent years, Graph Neural Network (GNN) has presented reliable competence in deep graph structure learning. GLCN (Jiang et al. 2019) exploits a unified framework for executing graph learning and graph convolution. IDGL (Chen, Wu, and Zaki 2020) leverages graph regularization terms to enhance learning quality. Pro-GNN (Jin et al. 2020) employs alternating optimization to realize the mutual regularization of graph learning and network update. GRCN (Yu et al. 2020) predicts edges and regulates weights adaptively by residual link. VIB-GSL (Sun et al. 2022) quotes a novel theory of variational information bottleneck to structure an adjacency graph. The above influential models have shown impressive accuracy in experiments. However, most of the deep models are tailored for some kind of supervised scenario, which is usually node classification. SUBLIME (Liu et al. 2022d) is described as the first attempt at unsupervised deep graph learning. In this paper, we plan to develop an unsupervised procedure to support graph-based clustering.

GNN-Based Data Clustering

To date, a series of GNN-based models have achieved excellent performance in graph-structured data clustering (Liu et al. 2022b). DAEGC (Wang et al. 2019) uses a new GAE based on graph attention network to learn pivotal node features. SDCN (Bo et al. 2020) performs GCN and auto-encoder simultaneously for better cluster learning. DFCN (Tu et al. 2021) applies a representation fusion module to integrate the topology structure and node attributes. DCRN (Liu et al. 2022c) utilizes the graph augmentation technology to explore different levels of latent characteristics, and then merges them into a consensus cluster representation.

In the above methods, the final cluster embedding relies on the input graph strongly, which means that the initial graph has a decisive influence on clustering performance. Therefore, the aforesaid models are not appropriate for the general clustering scenario without a reliable graph structure. Very lately, some modified GAEs (Li, Zhang, and Zhang 2021; Zhang et al. 2022) are established to handle general data, where the adjacency relation is updated automatically. However, GAE fails to preserve the discriminative characteristics within the original features. Moreover, most models lack clustering-oriented training guidance, and the learned graph may be unsuitable for clustering.

Contrastive Learning

As an effective tool in visual processing (Chen et al. 2020), contrastive learning has been applied to many tasks in the branch of GNN successfully, such as graph representation learning (Hassani and Khasahmadi 2020; Zhu et al. 2021; Mo et al. 2022) and graph-structured data clustering (Zhao et al. 2021; Liu et al. 2022a; Yang et al. 2023). In practical application, contrastive learning is usually joined with graph augmentation. Among them, graph augmentation aims to derive multiple data views from node-level and topology-level. Contrastive learning is used to maximize the mutual information between different views to generate the final node representation, which means pulling correlative semantic contents closer while pushing outlying features away. In this article, we introduce clustering-oriented guidance into contrastive learning methodologies, ensuring enhanced discriminability between samples and pursuing a distinguishable cluster structure.

Deep Contrastive Graph Learning

In this section, the proposed model is introduced. The pipeline of DCGL is illustrated in Fig. 1. It uses a presudo-siamese network to construct the local and global graphs, and maps them into the cluster space by siamese graph convolution.

Notations: in this paper, vectors and matrices are written in lowercase and uppercase letters respectively. For a data matrix 𝐗\mathbf{X}, both 𝐱i\mathbf{x}_{i} and 𝐗i\mathbf{X}_{i} represent the ii-th sample. 𝐱i1||\mathbf{x}_{i}||_{1}, 𝐱i2||\mathbf{x}_{i}||_{2}, and 𝐗F||\mathbf{X}||_{\mathrm{F}} denote 1\ell_{1}, 2\ell_{2}, and Frobenius norm respectively.

Preliminary Work

Given a dataset 𝐗=[𝐱1,𝐱2,𝐱3,,𝐱n]n×m\mathbf{X}=[\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3},\cdots,\mathbf{x}_{n}]\in\mathbb{R}^{n\times m} with cc categories, it is necessary to construct an initial graph as the input of GCN.

The initial graph indicates the relationship between two samples by the corresponding weight. Intuitively, a small Euclidean distance reflects a large weight. Therefore, the objective is expressed as

min𝐀i,jn𝐱i𝐱j22aij+γ𝐀F2,s.t.𝐚i1=1,0𝐚i1,\begin{array}[]{c}\mathop{\min}\limits_{\mathbf{A}}{\rm{}}\mathop{\sum}\limits_{i,j}^{n}{\rm{||}}{\mathbf{x}_{i}}-{\mathbf{x}_{j}}{\rm{||}}_{2}^{2}{a_{ij}}+\gamma{\rm{||}}\mathbf{A}{\rm{||}}_{\rm{F}}^{2},\\ s.t.{\rm{}}{\rm{||}}{\mathbf{a}_{i}}{\rm{|}}{{\rm{|}}_{1}}=1,{\rm{}}0\leq{\mathbf{a}_{i}}\leq 1,\end{array} (1)

where γ\gamma is the trade-off parameter, 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is the desired graph, and 𝐚i\mathbf{a}_{i} refers to the ii-th column of 𝐀\mathbf{A}. The second term aims to avoid the trivial solution 𝐀=𝐈\mathbf{A}=\mathbf{I}.

Problem (1) can be decoupled into each column of 𝐀\mathbf{A} separately. Denoting a column vector 𝐝in×1\mathbf{d}_{i}\in\mathbb{R}^{n\times 1} that satisfies dij=𝐱i𝐱j22d_{ij}=||\mathbf{x}_{i}-\mathbf{x}_{j}||_{2}^{2}, the sub-problem for 𝐚i\mathbf{a}_{i} can be written as

min𝐚i1=1,0𝐚i1𝐚i+12γ𝐝i22.\begin{array}[]{c}\mathop{\min{\rm{}}}\limits_{{\rm{||}}{\mathbf{a}_{i}}{\rm{|}}{{\rm{|}}_{1}}=1,{\rm{}}0\leq{\mathbf{a}_{i}}\leq 1{\rm{}}}\left\|{{\mathbf{a}_{i}}+\frac{1}{{2\gamma}}\mathbf{d}_{i}}\right\|_{2}^{2}.\end{array} (2)

Define 𝐰i\mathbf{w}_{i} as the result of sorting 𝐝i\mathbf{d}_{i} in ascending order. Stipulating that only the weights of kk neighbors are updated, problem (2) has a closed-form solution as

aij=(wi(k+1)wij)+kwi(k+1)ukwiu,\begin{array}[]{c}{a_{ij}}=\frac{{{{({w_{i(k+1)}}-{w_{ij}})}_{+}}}}{{k{w_{i(k+1)}}-\sum\nolimits_{u}^{k}{{w_{iu}}}}},\end{array} (3)

where ()+(\cdot)_{+} indicates max(,0)max(\cdot,0). The final graph is obtained by (𝐀+𝐀T)/2(\mathbf{A}+\mathbf{A}^{\mathrm{T}})/2, and the mathematical deduction can be found in (Nie, Wang, and Huang 2014).

With the scheme of Eq. (1), each sample connects with its neighbors adaptively. Moreover, the specific closed-form solution avoids the parameter selection of γ\gamma, and only the neighbor number kk needs to be tuned.

Pseudo-Siamese Network

After building the previous stage, the input for GCN is obtained, including data representation 𝐗\mathbf{X} and initial graph 𝐀\mathbf{A}. GCN mainly pays attention to the data relationship, but neglects the samples’ original attributes, and encounters the risk of representation collapse. To preserve the original features, we parallel GCN with an auto-encoder, leading to a pseudo-siamese network, which is formulated as

𝐇v1=GCN1(𝐗,𝐀^),𝐇v2=Encoder(𝐗),𝐗^=Decoder(𝐇v2),\begin{array}[]{c}{\mathbf{H}^{{v_{1}}}}=\mathrm{GCN}_{1}(\mathbf{X},\widehat{\mathbf{A}}),\\ {\mathbf{H}^{{v_{2}}}}=\mathrm{Encoder}(\mathbf{X}),\widehat{\mathbf{X}}=\mathrm{Decoder}(\mathbf{H}^{{v_{2}}}),\end{array} (4)

where 𝐀^\widehat{\mathbf{A}} is the normalized graph, 𝐀^=𝐃~12𝐀~𝐃~12\widehat{\mathbf{A}}=\widetilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{\widetilde{\mathbf{A}}}\widetilde{\mathbf{D}}^{-\frac{1}{2}}, 𝐀~=𝐀+𝐈\widetilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}, 𝐃~n×n\widetilde{\mathbf{D}}\in\mathbb{R}^{n\times n} refers to the diagonal degree matrix that satisfies 𝐃~ii=jn𝐀~ij\widetilde{\mathbf{D}}_{ii}=\sum_{j}^{n}\tilde{\mathbf{A}}_{ij}, and 𝐇v1n×l\mathbf{H}^{v_{1}}\in\mathbb{R}^{n\times l} and 𝐇v2n×l\mathbf{H}^{v_{2}}\in\mathbb{R}^{n\times l} captures the structure and attributes respectively. The classical Mean-Square Error is used to train the auto-encoder

AE=12nin𝐱i𝐱^i22.\begin{array}[]{c}{{\cal L}_{AE}}=\frac{1}{2n}\sum\limits_{i}^{n}{{{\left\|{{\mathbf{x}_{i}}-{{\widehat{\mathbf{x}}}_{i}}}\right\|}_{2}^{2}}}.\end{array} (5)

The above formulation focuses on data representation, and the training procedure lacks cluster-oriented guidance. Therefore, we design the following feature-level contrastive learning to improve the representations’ adaptability to clustering tasks.

Feature-Level Contrastive Learning.

In contrastive learning, the selection of positive and negative samples determines the final effect directly. For an anchor 𝐇iv1\mathbf{H}_{i}^{v_{1}}, the universal consensus (Zhu et al. 2021; Liu et al. 2022d) is to treat its embedding in the other branch as the positive sample (𝐇iv2\mathbf{H}_{i}^{v_{2}}), and all the rest as negative samples, which may misjudges too many samples in the same category as negative. Differently, we employ a rough clustering result to guide the construction of sample pairs. For each anchor, the centroids of other clusters are taken as negative ones.

Specifically, kk-means is executed on 𝐇iv2\mathbf{H}_{i}^{v_{2}} to acquire cc centroids. Given any node 𝐇iv1\mathbf{H}_{i}^{v_{1}}, the centroids of other clusters constitute the negative sample set 𝐏(i)(c1)×l\mathbf{P}^{(i)}\in\mathbb{R}^{(c-1)\times l}. On this basis, the InfoNCE loss (Oord, Li, and Vinyals 2018) is employed for feature-level contrastive learning

FL=1nin(logeθ(𝐇iv1,𝐇iv2)/τeθ(𝐇iv1,𝐇iv2)/τ+jc1eθ(𝐇iv1,𝐏j(i))/τ),\begin{array}[]{c}{{\cal L}_{FL}}=\frac{1}{n}\sum\limits_{i}^{n}{\left({-\log\frac{{{e^{\theta\left({\mathbf{H}_{i}^{{v_{1}}},\mathbf{H}_{i}^{{v_{2}}}}\right)/\tau}}}}{{{e^{\theta\left({\mathbf{H}_{i}^{{v_{1}}},\mathbf{H}_{i}^{{v_{2}}}}\right)/\tau}}+\sum\limits_{j}^{c-1}{{e^{\theta\left({\mathbf{H}_{i}^{{v_{1}}},\mathbf{P}^{(i)}_{j}}\right)/\tau}}}}}}\right)},\end{array} (6)

where θ(,)\theta(\cdot,\cdot) computes the cosine similarity, and τ\tau is the temperature parameter.

Eq. (6) pushes samples away from incorrect centroids, ensuring the discriminability of the learned representation.

Local Propinquity Graph Learning.

Replacing the 𝐱i\mathbf{x}_{i} in Eq. (1) with the learned 𝐇iv1\mathbf{H}_{i}^{{v_{1}}}, a new graph 𝐒L\mathbf{S}^{L} is derived. This graph learning problem can be converted to trace form as

GL=Tr((𝐇v1)T𝐋𝐒𝐇v1)+γ2Tr(𝐒L(𝐒L)T),\begin{array}[]{c}{{\cal L}_{GL}}=\mathrm{Tr}\left({{{({\mathbf{H}^{{v_{1}}}})}^{\mathrm{T}}}{\mathbf{L_{S}}}{\mathbf{H}^{{v_{1}}}}}\right)+\frac{\gamma}{2}\mathrm{Tr}\left(\mathbf{S}^{L}(\mathbf{S}^{L})^{\mathrm{T}}\right),\end{array} (7)

where Tr()\mathrm{Tr}(\cdot) denotes the trace of a square matrix, and 𝐋𝐒n×n\mathbf{L_{S}}\in\mathbb{R}^{n\times n} is the Laplace matrix of 𝐒L\mathbf{S}^{L}. The above formulation infers the manifold structure based on the pair-wise similarity, so we term 𝐒L\mathbf{S}^{L} as Local Propinquity Graph (LPG). With a small value of kk, only the very close samples will be connected, such that the final graph contains fewer incorrect connections. To detect more potential connections, we adopt a staged growth for the neighbor number, which means adding kk with a fixed increment after a constant interval.

Siamese Graph Convolution

The optimization of LPG focuses on the local dependencies between nodes. Considering that some samples of the same class may are not neighbors in Euclidean space, we propose to learn the global similarity with an extra graph, and fuse both the local and global graphs to achieve an agreement on cluster structure.

Global Diffusion Graph Learning.

Contrary to LPG, the new graph encodes the global topological relationship of samples, which requires more links between nodes. Firstly, the structural and attributed nodes are combined to generate a comprehensive representation

𝐇=12(𝐇v1+𝐇v2).\begin{array}[]{c}\mathbf{H}=\frac{1}{2}({\mathbf{H}^{{v_{1}}}}+{\mathbf{H}^{{v_{2}}}}).\end{array} (8)

Then, the integrated 𝐇\mathbf{H} is used as the data matrix to construct a graph 𝐆\mathbf{G} with Eq. (1), which further exploits the original features. The neighbor number for 𝐆\mathbf{G} is fixed as the upper bound nc\left\lfloor{\frac{n}{c}}\right\rfloor of the learning procedure of LGP, where \lfloor{\cdot}\rfloor denotes the operation of round down.

Finally, Personalized PageRank graph diffusion (Hassani and Khasahmadi 2020) is introduced to transport the topological similarity through nodes, so as to further capture the global structure. The mathematical expression is

𝐒G=i=0λ(1λ)i(𝐃𝐆12𝐆𝐃𝐆12)i,\begin{array}[]{c}\mathbf{S}^{G}=\sum\limits_{i=0}^{\infty}{\lambda{{\left({1-\lambda}\right)}^{i}}{\left({\mathbf{D_{G}}}^{-\frac{1}{2}}\mathbf{G}{\mathbf{D_{G}}}^{-\frac{1}{2}}\right)^{i}}},\end{array} (9)

where λ\lambda is the transport rate, and 𝐃𝐆n×n\mathbf{D_{G}}\in\mathbb{R}^{n\times n} is the degree matrix of 𝐆\mathbf{G}. The corresponding closed-form solution is

𝐒G=λ[𝐈𝐧(1λ)𝐃𝐆12𝐆𝐃𝐆12]1,\begin{array}[]{c}{\mathbf{S}^{G}}=\lambda{\left[{{\mathbf{I_{n}}}-(1-\lambda)\mathbf{D_{G}}^{-\frac{1}{2}}\mathbf{G}\mathbf{D_{G}}^{-\frac{1}{2}}}\right]^{-1}},\end{array} (10)

where 𝐈𝐧n×n\mathbf{I_{n}}\in\mathbb{R}^{n\times n} is the identity matrix.

𝐒G\mathbf{S}^{G} is called Global Diffusion Graph (GDG). Compared with the straightforward inner product, the construction of GDG focalizes manifold information of each cluster, rather than the holistic sample space.

Cluster-Level Contrastive Learning.

With both the local graph 𝐒L\mathbf{S}^{L} and global graph 𝐒G\mathbf{S}^{G}, contrastive learning is performed within the cluster space.

We assume that both graphs should share the same cluster structure. To this end, the graphs and embeddings are fed into a siamese GCN to get the cluster indicators

𝐅v1=GCN2(𝐇v1,𝐒^L),𝐅v2=GCN3(𝐇v1,𝐒^G),\begin{array}[]{c}{\mathbf{F}^{v_{1}}}=\mathrm{GCN}_{2}({\mathbf{H}^{v_{1}}},\widehat{\mathbf{S}}^{L}),\\ {\mathbf{F}^{v_{2}}}=\mathrm{GCN}_{3}({\mathbf{H}^{v_{1}}},{\widehat{\mathbf{S}}^{G}}),\end{array} (11)

where 𝐒^L\widehat{\mathbf{S}}^{L} and 𝐒^G\widehat{\mathbf{S}}^{G} are the normalized graphs of 𝐒L\mathbf{S}^{L} and 𝐒G\mathbf{S}^{G}. The softmax function is employed to activate the last layer of GCN. In this way, each element of 𝐅v1n×c\mathbf{F}^{v_{1}}\in\mathbb{R}^{n\times c} and 𝐅v2n×c\mathbf{F}^{v_{2}}\in\mathbb{R}^{n\times c} records the probability that a sample belongs to a cluster. To emphasize the optimization of LPG, both GCN2\mathrm{GCN}_{2} and GCN3\mathrm{GCN}_{3} use the structural representation 𝐇v1\mathbf{H}^{v_{1}} as input.

Then, the cluster-level embedding is calculated as

𝐙v1=(𝐅v1)T𝐇v1,𝐙v2=(𝐅v2)T𝐇v1,\begin{array}[]{c}{\mathbf{Z}^{v_{1}}}={({\mathbf{F}^{v_{1}}})^{\mathrm{T}}}{\mathbf{H}^{v_{1}}},{\mathbf{Z}^{v_{2}}}={({\mathbf{F}^{v_{2}}})^{\mathrm{T}}}{\mathbf{H}^{v_{1}}},\end{array} (12)

where 𝐙v1c×l\mathbf{Z}^{v_{1}}\in\mathbb{R}^{c\times l} and 𝐙v2c×l\mathbf{Z}^{v_{2}}\in\mathbb{R}^{c\times l} consist of the cluster centroids. Eq. (12) projects the cluster indicators into centroid matrices. Afterward, the centroids are used as anchors, and contrastive learning is conducted to search a shared centroid distribution

CL=12cic[Ω(𝐙iv1,𝐙iv2)+Ω(𝐙iv2,𝐙iv1)],\begin{array}[]{c}{{\cal L}_{CL}}=\frac{1}{{2c}}\sum\limits_{i}^{c}{\left[{\Omega(\mathbf{Z}^{v_{1}}_{i},\mathbf{Z}^{v_{2}}_{i})+\Omega(\mathbf{Z}^{v_{2}}_{i},\mathbf{Z}^{v_{1}}_{i})}\right]},\end{array} (13)

where

Ω(𝐙iv1,𝐙iv2)=logeθ(𝐙iv1,𝐙iv2)/τjceθ(𝐙iv1,𝐙jv2)/τ+jiceθ(𝐙iv1,𝐙jv1)/τ.\begin{array}[]{c}\Omega(\mathbf{Z}^{v_{1}}_{i},\mathbf{Z}^{v_{2}}_{i})=-\log\frac{{{e^{\theta\left({\mathbf{Z}^{v_{1}}_{i},\mathbf{Z}^{v_{2}}_{i}}\right)/\tau}}}}{{\sum\limits_{j}^{c}{{e^{\theta\left({\mathbf{Z}^{v_{1}}_{i},\mathbf{Z}^{v_{2}}_{j}}\right)/\tau}}}}}+\sum\limits_{j\neq i}^{c}{{e^{\theta\left({\mathbf{Z}^{v_{1}}_{i},\mathbf{Z}^{v_{1}}_{j}}\right)/\tau}}}.\end{array} (14)

Eq. (13) pushes the centroids of different clusters away, and ensures an explicit cluster structure. The siamese graph convolution module does not involve the initial graph 𝐀\mathbf{A}, so the impact of the initial graph quality is reduced. Moreover, both cluster-level and feature-level contrastive learning employ centroids to leverage cluster structure, and further provide training guidance to facilitate clustering.

Algorithm 1 DCGL

Input: Data matrix 𝐗\mathbf{X} with nn samples, cluster number cc, initial neighbor number kk, neighbor number update interval tt, maximum iterations iteriter, parameters α\alpha, β\beta, γ\gamma.
Output: Clustering result for 𝐗\mathbf{X}.

1:  Let kinit=k,i=1k^{init}=k,i=1.
2:  Compute 𝐀\mathbf{A} by solving problem (1) with 𝐗\mathbf{X} and kinitk^{init}.
3:  while knck\leq\lfloor{\frac{n}{c}}\rfloor and iiteri\leq iter do
4:     Obtain 𝐇v1\mathbf{H}^{v_{1}}, 𝐇v2\mathbf{H}^{v_{2}}, and 𝐗^\widehat{\mathbf{X}} by pseudo-siamese network.
5:     Perform kk-means on 𝐇v2\mathbf{H}^{v_{2}} to obtain cc centroids.
6:     Compute 𝐒L\mathbf{S}^{L} by solving problem (1) with 𝐇v1\mathbf{H}^{v_{1}} and kk.
7:     Compute 𝐇\mathbf{H} with Eq. (8).
8:     Compute 𝐆\mathbf{G} by solving problem (1) with 𝐇\mathbf{H} and nc\lfloor{\frac{n}{c}}\rfloor.
9:     Compute 𝐒G\mathbf{S}^{G} with Eq. (10).
10:     Obtain 𝐅v1\mathbf{F}^{v_{1}} and 𝐅v2\mathbf{F}^{v_{2}} by siamese GCN.
11:     Compute 𝐙v1\mathbf{Z}^{v_{1}} and 𝐙v2\mathbf{Z}^{v_{2}} with Eq. (12).
12:     Compute AE{\cal L}_{AE}, FL{\cal L}_{FL}, GL{\cal L}_{GL}, and CL{\cal L}_{CL}, respectively.
13:     Update model by minimizing Eq. (15) with Adam.
14:     if i%t==0i\%t==0 then
15:        Let k=k+kinitk=k+k^{init}.
16:     end if
17:     Let i=i+1i=i+1.
18:  end while
19:  Perform spectral clustering on 𝐒L\mathbf{S}^{L}.
20:  return  Cluster labels.
Refer to caption

Raw

Refer to caption

Epoch-6

Refer to caption

Epoch-12

Refer to caption

Epoch-18

Refer to caption

Epoch-24

Refer to caption

Final

Figure 2: Visualization of partial embeddings learned by DCGL on YaleB.

Overall Model and Optimization

The overall loss function of DCGL is

=AE+FL+αGL+βCL,\begin{array}[]{c}{\cal L}={{\cal L}_{AE}}+{{\cal L}_{FL}}+\alpha{{\cal L}_{GL}}+\beta{{\cal L}_{CL}},\end{array} (15)

where α\alpha and β\beta are the trade-off parameters.

The optimizer for DCGL is Adam (Kingma and Ba 2014), and the corresponding parameters are predefined with the authors’ recommendation. The temperature in contrastive learning is set as 0.5. The parameter λ\lambda in Eq. (10) is fixed to 0.2. The convergence condition of DCGL is that the neighbor amount of LPG reaches the upper bound. Algorithm 1 displays the complete flow of DCGL, and the final result is obtained by performing NCut (Shi and Malik 2000) on the converged LPG.

Experiments

Benchmark Datasets

Seven publicly available datasets are collected as benchmarks, including regular record types TOX-171 and Isolet, image types ORL, YaleB, PIE and USPS, and text type TR41. In YaleB, only the previous 30 images of each class are retained. All samples are normalized with 2\ell_{2} norm. The details are exhibited in Table 1.

Evaluation Metrics

Two metrics named accuracy (ACC) and normalized mutual information (NMI) are used to evaluate the clustering performance. The mathematical formulas of ACC and NMI can be found in (Kong, Ding, and Huang 2011). Higher metric values indicate better clustering performance.

Performance Comparison

Competitors.

Thirteen state-of-the-art clustering methods are applied for performance comparison, including six traditional algorithms kk-means (Hartigan and Wong 1979), STSC (Zelnik-Manor and Perona 2004), CAN (Nie, Wang, and Huang 2014), PCAN (Nie, Wang, and Huang 2014), RGC (Kang et al. 2019) and GEMMF (Chen and Li 2022), and seven deep models DAEGC (Wang et al. 2019), DCRN (Liu et al. 2022c), DEC (Xie, Girshick, and Farhadi 2016), VGAE (Kipf and Welling 2016b), SpectralNet (Shaham et al. 2018), SUBLIME (Liu et al. 2022d) and AdaGAE (Li, Zhang, and Zhang 2021).

Except for DAEGC and DCRN, the other methods have been attested to be appropriate for non-graph-type data clustering. All codes are downloaded from online resources uploaded by authors.

Settings.

The hyper-parameters of competitors are set as the recommendations of the authors. For example, the graph regularization weight of GEMMF is searched from {101,103,105,107,109}\{10^{1},10^{3},10^{5},10^{7},10^{9}\}. The probability of dropout layers is from {0.1,0.5,0.9}\{0.1,0.5,0.9\}. For those methods that require an initial graph, a KNN graph via Gaussian kernel weighting (Cai et al. 2010) is constructed as input, and the number of neighbors is selected from {5,6,7,8,9,10}\{5,6,7,8,9,10\}.

For DCGL, the hyper-parameters α\alpha, β\beta, and γ\gamma are fixed to 11, 10310^{3}, and 2×1032\times 10^{3} respectively. The neighbor number for LPG rises every 6 epochs. The maximum epoch number is 3030. To ensure objectivity, the random seed is fixed before code execution, and each algorithm is repeated 10 times.

Dataset Type Samples Dimension Classes
TOX-171 Record 171 5748 4
ORL Image 400 1024 40
TR41 Text 878 7454 10
YaleB Image 1140 1024 38
Isolet Record 1560 617 26
PIE Image 2856 1024 64
USPS Image 9298 256 10
Table 1: Details of each benchmark dataset.
Method TOX-171 ORL TR41 YaleB Isolet PIE USPS
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
kk-means 41.51 12.40 51.50 71.07 66.74 67.87 11.58 21.48 59.23 75.43 23.28 53.56 67.49 61.35
STSC 49.71 26.35 65.50 77.70 72.67 68.16 49.04 62.29 58.91 76.62 86.80 93.51 66.44 80.03
CAN 32.75 10.28 56.00 72.04 51.59 35.61 39.74 52.50 60.77 77.38 91.81 97.02 74.10 76.89
PCAN 45.03 21.66 61.00 74.18 62.30 53.97 45.09 49.73 57.24 69.87 85.68 87.88 65.41 76.16
RGC 50.88 28.60 59.50 77.11 64.24 64.06 31.32 43.68 63.53 77.92 40.62 68.29 72.00 78.93
GEMMF 43.86 13.19 64.75 79.29 58.89 57.74 31.40 51.84 63.53 78.23 41.81 71.35 69.96 66.43
DAEGC 46.20 16.90 59.50 76.09 61.62 57.87 23.86 42.86 59.68 73.11 39.57 71.48 70.43 59.77
DCRN 36.84 12.89 34.25 60.10 62.30 54.14 15.61 34.03 15.06 36.99 17.75 47.59 18.12 5.27
DEC 49.71 23.65 49.25 69.33 23.46 10.88 21.84 35.60 52.76 62.85 30.22 59.62 76.34 66.72
VGAE 49.71 22.99 56.75 74.99 72.78 64.25 27.19 40.60 63.59 77.06 38.41 63.17 66.66 60.78
SpectralNet 46.78 20.55 53.50 72.66 34.62 21.63 19.91 33.66 56.03 66.21 79.03 87.87 49.31 39.70
SUBLIME 51.46 26.87 68.75 82.29 65.49 66.85 47.72 62.41 64.41 76.64 82.35 90.15 59.14 64.50
AdaGAE 49.71 30.60 69.50 84.40 55.81 59.46 67.02 80.68 69.49 80.83 97.97 98.18 76.44 74.63
DCGL 51.46 34.75 75.50 86.91 73.23 69.80 81.93 86.18 71.73 80.85 99.86 99.87 78.32 80.87
Table 2: Clustering performance of different methods on benchmark datasets. Bold and underlined values indicate the optimal and sub-optimal results respectively.

Results.

The best clustering results of each algorithm are recorded in Table 2. We summarize the viewpoints that 1) STSC, GEMMF, VGAE, and SpectralNet perform worse than other graph-based methods relatively. They keep the graph structure fixed during iteration, so the final performance is limited by the initial graph quality. This reveals that adaptive graph learning is beneficial to clustering; 2) both DAEGC and DCRN provide unsatisfactory results. These models assume that the initial graph is highly reliable, and learn the cluster embeddings mainly based on the adjacency relation. The experimental results attest that such graph-type data clustering models are not suitable for general clustering tasks; 3) the deep models for general data clustering, i.e., SUBLIME, AdaGAE, and DCGL, realize a noticeable improvement. This verifies the potential of deep graph-based clustering. The good performance of SUBLIME and DCGL also demonstrates the feasibility of contrastive learning on processing non-vision data; 4) DCGL achieves better clustering compared to GCN-based competitors, which reflects the effects of restraining representation collapse and mining the original features. Besides, the difference between DCGL and GAE-based models mainly dates from the siamese graph convolution module that bootstraps a clear cluster structure gradually. With the cooperation of clustering-oriented mechanisms, DCGL is guided to preserve clustering-relevant information, and presents superior performance compared to other algorithms.

Visualization Analysis

To display the effect of DCGL intuitively, we use tt-SNE (Van der Maaten and Hinton 2008) to visualize the embeddings from GCN1\mathrm{GCN}_{1} when clustering YaleB. Only the front 10 clusters are processed for ease of observation. The visualization result is shown in Fig. 2, where each point is plotted as its actual label value. It can be seen that the samples of the same category are gradually compact with iteration, which indicates the practicability of DCGL on data clustering.

Ablation Studies

In this part, we first validate the effectiveness of two contrastive learning modules in DCGL, and then explore the impact of clustering-oriented mechanisms. Afterward, we analyze the impact of hyper-parameters.

Impact of Contrastive Learning.

Refer to caption

DCGL-w/F

Refer to caption

DCGL

Figure 3: Ablation results of feature-level contrastive learning on PIE.

Define DCGL-w/F and DCGL-w/C as the DCGL variants without feature-level and cluster-level contrastive learning respectively. Since the outputs of GCN2\mathrm{GCN}_{2} and GCN3\mathrm{GCN}_{3} are only related to cluster-level contrastive learning, DCGL-w/C means the entire expurgation of the siamese graph convolution module. Table 3 shows the ablation comparisons. Obviously, DCGL still presents the best clustering performance, which proves the effectiveness of each contrastive learning strategy.

The performance of DCGL-w/F is deteriorated compared to DCGL, which demonstrates the discriminability of the embeddings learned by DCGL. For a better illustration, we calculate the cosine similarity matrix of the data representation from GCN1\mathrm{GCN}_{1}. The heatmap on Pie is displayed in Fig. 3, where only the large values are preserved to facilitate observation. In the result of DCGL, the nodes across diverse classes have lower similarity, so we can conclude that feature-level contrastive learning is beneficial to improving discriminability.

Refer to caption

DCGL-w/C

Refer to caption

DCGL

Figure 4: Ablation results of cluster-level contrastive learning on ORL.

DCGL exceeds DCGL-w/C on all benchmark datasets, which means the graph generated by DCGL maintains a clearer cluster structure. To verify the efficiency of cluster-level contrastive learning on graph optimization, we plot the converged LPGs from DCGL-w/C and DCGL on ORL in Fig. 4, where all non-zero elements are substituted with 1 to emphasize the adjacency relation. It can be seen that the graph obtained by DCGL exhibits a more distinct block diagonal structure, indicating more links between samples of the same category.

Dataset Metric DCGL-w/F DCGL-w/C DCGL
TOX-171 ACC 47.95 42.11 51.46
NMI 24.17 24.84 34.75
ORL ACC 70.50 64.00 75.50
NMI 83.72 81.17 86.91
TR41 ACC 61.73 63.21 73.23
NMI 64.85 65.39 69.80
YaleB ACC 75.18 77.54 81.93
NMI 82.89 84.23 86.18
Isolet ACC 69.23 67.63 71.73
NMI 79.63 78.92 80.85
PIE ACC 99.68 98.35 99.86
NMI 99.69 99.65 99.87
USPS ACC 76.75 76.98 78.32
NMI 80.19 79.09 80.87
Table 3: Ablation results of each contrastive learning module in DCGL. Bold values indicate the optimal results.
Refer to caption

TOX-171

Refer to caption

ORL

Refer to caption

YaleB

Refer to caption

Isolet

Figure 5: Ablation results of clustering guidance on four datasets.

Impact of Clustering Guidance.

In the proposed contrastive learning strategies, feature-level and cluster-level guidance are introduced into the construction of sample pairs and the selection of anchors. To investigate the effectiveness of the above clustering guidance, we design three variants of DCGL with different contrastive learning approaches, termed as DCGL-w/Fg, DCGL-w/Cg, and DCGL-w/all. Among them, DCGL-w/Fg discards the feature-level guidance, and regards all non-corresponding nodes of the anchor in the other branch as negative samples in feature-level contrastive learning. DCGL-w/Cg removes the cluster-level guidance by taking the output embeddings of GCN2\mathrm{GCN}_{2} and GCN3\mathrm{GCN}_{3} as anchors directly. DCGL-w/all has the above two alterations.

Fig. 5 presents the comparison results. We can deduce that 1) the evident degradation of DCGL-w/all confirms the significance of clustering guidance; 2) the decline of DCGL-w/Fg implies that the discriminability is decreased without the feature-level guidance; 3) the disparity between DCGL-w/Cg and DCGL demonstrates that contrastive learning within the cluster space is helpful to perceive a clear cluster distribution.

Sensitivity of Hyper-Parameters α\alpha and β\beta.

Finally, we focus on the influence of trade-off parameters α\alpha and β\beta. We tune both α\alpha and β\beta over the range of {103,102,101,1,101,102,103}\{10^{-3},10^{-2},10^{-1},1,10^{1},10^{2},10^{3}\}, and keep other experimental environments. Fig. 6 exhibits the clustering ACC of DCGL with different parameter combinations. It is observed that 1) the performance under too small α\alpha and β\beta is not outstanding, which manifests the failure of related mechanisms, and again attests to the effectiveness of the designed modules; 2) the influence of α\alpha is more salient relatively, because α\alpha is associated with graph learning and the final cluster partitioning directly; 3) the setting of moderate α\alpha and larger β\beta is likely to bring satisfactory clustering results. Overall, DCGL is not particularly sensitive to parameter selection in an appropriate range.

Refer to caption

ORL

Refer to caption

TR41

Refer to caption

Isolet

Refer to caption

PIE

Figure 6: ACC of DCGL with different hyper-parameters α\alpha and β\beta.

Conclusion

In this paper, we establish the Deep Contrastive Graph Learning (DCGL) model for non-graph data clustering. DCGL adopts a pseudo-siamese network to learn both the structural and attributed representations. Two clustering-oriented contrastive learning strategies are also designed to capture clustering-relevant information. Concretely, feature-level contrastive learning is performed to push each node away from the centroids of other clusters, to retain the discriminative features of each class. Besides, cluster-level contrastive learning is conducted to search an explicit centroid distribution shared by both the learned local and global graphs, such that a clear cluster structure is achieved. Extensive experiments on benchmark datasets demonstrate the superiority of DCGL against state-of-the-art competitors.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (No. 2022YFC2808000) and by the National Natural Science Foundation of China (No. 61871470).

References

  • Bo et al. (2020) Bo, D.; Wang, X.; Shi, C.; Zhu, M.; Lu, E.; and Cui, P. 2020. Structural deep clustering network. In Proceedings of the Web Conference 2020, 1400–1410.
  • Cai et al. (2010) Cai, D.; He, X.; Han, J.; and Huang, T. S. 2010. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8): 1548–1560.
  • Chen and Li (2022) Chen, M.; and Li, X. 2022. Entropy minimizing matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 34(11): 9209–9222.
  • Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607.
  • Chen, Wu, and Zaki (2020) Chen, Y.; Wu, L.; and Zaki, M. 2020. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. Advances in Neural Information Processing Systems, 33: 19314–19326.
  • Elhamifar and Vidal (2013) Elhamifar, E.; and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11): 2765–2781.
  • Hartigan and Wong (1979) Hartigan, J. A.; and Wong, M. A. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (applied statistics), 28(1): 100–108.
  • Hassani and Khasahmadi (2020) Hassani, K.; and Khasahmadi, A. H. 2020. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, 4116–4126.
  • Huang et al. (2022) Huang, S.; Wu, H.; Ren, Y.; Tsang, I.; Xu, Z.; Feng, W.; and Lv, J. 2022. Multi-view subspace clustering on topological manifold. Advances in Neural Information Processing Systems, 35: 25883–25894.
  • Jiang et al. (2019) Jiang, B.; Zhang, Z.; Lin, D.; Tang, J.; and Luo, B. 2019. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320.
  • Jin et al. (2020) Jin, W.; Ma, Y.; Liu, X.; Tang, X.; Wang, S.; and Tang, J. 2020. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 66–74.
  • Kang et al. (2019) Kang, Z.; Pan, H.; Hoi, S. C.; and Xu, Z. 2019. Robust graph learning from noisy data. IEEE Transactions on Cybernetics, 50(5): 1833–1843.
  • Kang, Peng, and Cheng (2017) Kang, Z.; Peng, C.; and Cheng, Q. 2017. Twin learning for similarity and clustering: A unified kernel approach. In Proceedings of the AAAI Conference on Artificial Intelligence, 2080–2086.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kipf and Welling (2016a) Kipf, T. N.; and Welling, M. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Kipf and Welling (2016b) Kipf, T. N.; and Welling, M. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.
  • Kong, Ding, and Huang (2011) Kong, D.; Ding, C.; and Huang, H. 2011. Robust nonnegative matrix factorization using l21-norm. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 673–682.
  • Li (2022a) Li, X. 2022a. Multi-modal cognitive computing. SCIENTIA SINICA Informationis, 53(1): 1–32.
  • Li (2022b) Li, X. 2022b. Vicinagearth security. Communication of the CCF, 18(11): 44–52.
  • Li et al. (2017a) Li, X.; Chen, M.; Nie, F.; and Wang, Q. 2017a. Locality adaptive discriminant analysis. In International Joint Conference on Artificial Intelligence, 2201–2207.
  • Li et al. (2017b) Li, X.; Chen, M.; Nie, F.; and Wang, Q. 2017b. A multiview-based parameter free framework for group detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 4147–4153.
  • Li, Zhang, and Zhang (2021) Li, X.; Zhang, H.; and Zhang, R. 2021. Adaptive graph auto-encoder for general data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12): 9725–9732.
  • Liu et al. (2022a) Liu, L.; Kang, Z.; Ruan, J.; and He, X. 2022a. Multilayer graph contrastive clustering network. Information Sciences, 613: 256–267.
  • Liu et al. (2022b) Liu, Y.; Jun, X.; Sihang, Z.; Siwei, W.; Xifeng, G.; Xihong, Y.; Ke, L.; Wenxuan, T.; Wang, L. X.; et al. 2022b. A survey of deep graph clustering: Taxonomy, challenge, and application. arXiv preprint arXiv:2211.12875.
  • Liu et al. (2022c) Liu, Y.; Tu, W.; Zhou, S.; Liu, X.; Song, L.; Yang, X.; and Zhu, E. 2022c. Deep graph clustering via dual correlation reduction. In Proceedings of the AAAI Conference on Artificial Intelligence, 7603–7611.
  • Liu et al. (2023) Liu, Y.; Yang, X.; Zhou, S.; Liu, X.; Wang, S.; Liang, K.; Tu, W.; and Li, L. 2023. Simple contrastive graph clustering. IEEE Transactions on Neural Networks and Learning Systems.
  • Liu et al. (2022d) Liu, Y.; Zheng, Y.; Zhang, D.; Chen, H.; Peng, H.; and Pan, S. 2022d. Towards unsupervised deep graph structure learning. In Proceedings of the Web Conference 2022, 1392–1403.
  • Mo et al. (2022) Mo, Y.; Peng, L.; Xu, J.; Shi, X.; and Zhu, X. 2022. Simple unsupervised graph representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 7797–7805.
  • Nie, Wang, and Huang (2014) Nie, F.; Wang, X.; and Huang, H. 2014. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 977–986.
  • Oord, Li, and Vinyals (2018) Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Ren et al. (2022) Ren, Y.; Pu, J.; Yang, Z.; Xu, J.; Li, G.; Pu, X.; Yu, P. S.; and He, L. 2022. Deep clustering: A comprehensive survey. arXiv preprint arXiv:2210.04142.
  • Shaham et al. (2018) Shaham, U.; Stanton, K.; Li, H.; Nadler, B.; Basri, R.; and Kluger, Y. 2018. SpectralNet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587.
  • Shi and Malik (2000) Shi, J.; and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8): 888–905.
  • Sun et al. (2022) Sun, Q.; Li, J.; Peng, H.; Wu, J.; Fu, X.; Ji, C.; and Philip, S. Y. 2022. Graph structure learning with variational information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, 4165–4174.
  • Tan et al. (2023) Tan, Y.; Liu, Y.; Huang, S.; Feng, W.; and Lv, J. 2023. Sample-level multi-view graph clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23966–23975.
  • Tu et al. (2021) Tu, W.; Zhou, S.; Liu, X.; Guo, X.; Cai, Z.; Zhu, E.; and Cheng, J. 2021. Deep fusion clustering network. In Proceedings of the AAAI Conference on Artificial Intelligence, 9978–9987.
  • Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11).
  • Wang et al. (2019) Wang, C.; Pan, S.; Hu, R.; Long, G.; Jiang, J.; and Zhang, C. 2019. Attributed graph clustering: A deep attentional embedding approach. In International Joint Conference on Artificial Intelligence, 3670–3676.
  • Xie, Girshick, and Farhadi (2016) Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, 478–487.
  • Yang et al. (2023) Yang, X.; Liu, Y.; Zhou, S.; Wang, S.; Tu, W.; Zheng, Q.; Liu, X.; Fang, L.; and Zhu, E. 2023. Cluster-guided contrastive graph clustering network. In Proceedings of the AAAI Conference on Artificial Intelligence, 10834–10842.
  • Yu et al. (2020) Yu, D.; Zhang, R.; Jiang, Z.; Wu, Y.; and Yang, Y. 2020. Graph-revised convolutional network. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 378–393.
  • Zelnik-Manor and Perona (2004) Zelnik-Manor, L.; and Perona, P. 2004. Self-tuning spectral clustering. Advances in Neural Information Processing Systems, 17.
  • Zhang et al. (2022) Zhang, H.; Shi, J.; Zhang, R.; and Li, X. 2022. Non-graph data clustering via O(n)O(n) bipartite graph convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Zhao et al. (2021) Zhao, H.; Yang, X.; Wang, Z.; Yang, E.; and Deng, C. 2021. Graph debiased contrastive learning with joint representation clustering. In International Joint Conference on Artificial Intelligence, 3434–3440.
  • Zhao et al. (2022) Zhao, J.; Guo, J.; Sun, Y.; Gao, J.; Wang, S.; and Yin, B. 2022. Adaptive graph convolutional clustering network with optimal probabilistic graph. Neural Networks, 156: 271–284.
  • Zhu et al. (2021) Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; and Wang, L. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, 2069–2080.