This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contrastive Representation Learning Based on Multiple Node-centered Subgraphs

Dong Li Tianjin UniversityTianjinChina [email protected] Wenjun Wang Tianjin UniversityTianjinChina [email protected] Minglai Shao Tianjin UniversityTianjinChina [email protected]  and  Chen Zhao Baylor UniversityWaco, Texas USA chen˙[email protected]
(2023)
Abstract.

As the basic element of graph-structured data, node has been recognized as the main object of study in graph representation learning. A single node intuitively has multiple node-centered subgraphs from the whole graph (e.g., one person in a social network has multiple social circles based on his different relationships). We study this intuition under the framework of graph contrastive learning, and propose a multiple node-centered subgraphs contrastive representation learning method to learn node representation on graphs in a self-supervised way. Specifically, we carefully design a series of node-centered regional subgraphs of the central node. Then, the mutual information between different subgraphs of the same node is maximized by contrastive loss. Experiments on various real-world datasets and different downstream tasks demonstrate that our model has achieved state-of-the-art results.

Contrastive Learning, Node-centered Subgraph, Graph Representation Learning
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdombooktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdomprice: 15.00doi: 10.1145/3583780.3614825isbn: 979-8-4007-0124-5/23/10ccs: Computing methodologies Unsupervised learningccs: Computing methodologies Neural networksccs: Computing methodologies Learning latent representations

1. Introduction

Graph representation learning has received increasing attention recently (Hamilton et al., 2017b), which aims to transform high-dimensional graph-structured data into low-dimensional dense vectorized representations. As the basic elements of graph-structured data, node representation has been the main object of graph representation learning. A comprehensive node representation can be well used for a variety of downstream tasks, such as node classification (Kipf and Welling, 2016a) and link prediction (Grover and Leskovec, 2016).

Refer to caption
Figure 1. A descriptive illustration of different proxy tasks among DGI, GMI, GIC and our proposed MNCSCL. The two-way arrows represent MI maximization, and the different colors represent different models. The subgraph generator and multiple subgraphs of hih_{i} are described in details in Section 3.1 and Section 3.2.

A widespread graph representation learning method is the application of graph neural networks (GNN) (Kipf and Welling, 2016a; Veličković et al., 2017; Wu et al., 2019; Zhao et al., 2020; Hamilton et al., 2017a). But most of such methods focus on supervised learning, relying on supervised signals in the graph. For real-world networks, the acquisition of these supervised signals is often cumbersome and expensive. Self-supervised learning (Jing and Tian, 2020) is a popular research area in recent years, which designs proxy tasks for unlabeled data to mine the representational properties of the data itself as supervised information.

Refer to caption
Figure 2. The pipelines of MNCSCL. For a specific node viv_{i} with attribute xi\textbf{x}_{i}, we first get its negative example x~i\tilde{\textbf{x}}_{i} through perturbing the structure and attributes of the input graph 𝒢\mathcal{G} with a corruption function 𝒞\mathcal{C}. Then we use a subgraph generator 𝒯\mathcal{T} to get a series of node-centered subgraphs Gi\displaystyle\textbf{G}_{i} and their corresponding negative set G~i\displaystyle\tilde{\textbf{G}}_{i} from hi\textbf{h}_{i} and h~i\tilde{\textbf{h}}_{i} (obtained by a shared encoder \mathcal{F}). Finally, the mutual information between Gi\displaystyle\textbf{G}_{i} and G~\displaystyle\tilde{\textbf{G}} is maximized in the latent space Vi\displaystyle\textbf{V}_{i} and Ui\displaystyle\textbf{U}_{i} (obtained by a readout function \mathcal{R}) by contrastive loss. For more details, refer to the “overall framework” subsection in Section 3.

As one of the representative methods of self-supervised learning, the proxy task of contrastive learning is to maximize the Mutual Information (MI) (Qiu et al., 2020) between the input and related content (Xu et al., 2021). For example, Deep Graph Infomax (DGI) (Velickovic et al., 2019) maximizes MI between local view and global view of the input graph and its corresponding corrupted graph. Graphical Mutual Information (GMI) (Peng et al., 2020) doesn’t use corruption function, instead, it maximizes the MI between the hidden representation of nodes and their original local structure. Graph InfoClust (GIC) (Mavromatis and Karypis, 2020), on the other hand, maximizes the MI between the node representation and its corresponding cluster representation on the basis of DGI. Although these methods have achieved many advances, they all focus on the MI between node embeddings and only one related graph structure, as shown in Figure 1.

In reality, we can look at a specific thing from multiple perspectives. For graph data, we can observe individual nodes in a graph from multiple perspectives, yet little literature has focused on this. Intuitively, for an individual in a social network, there may be a social circle of relatives based on blood relations, a social circle of colleagues based on work relations, and other social circles of friends with many different interests. If we analyze this individual from these different social circles, it is actually equivalent to learning from multiple perspectives on the nodes in this network.

Based on this intuition, we propose Multiple Node-centered Subgraphs Contrastive Representation Learning (MNCSCL). MNCSCL takes each node in the network as the center and samples its node-centered regional subgraphs under different semantics, thus forming several different perspectives of the corresponding node, as shown in Figure 1. More specifically, we first generate a negative example through the corrupt function, then generate a series of node-centered subgraphs of the original graph by the view generator, and sample the corresponding subgraphs on the negative example. Then, these subgraphs are fed into graph neural network encoders to obtain the representations of central nodes and its subgraphs after pooling. Finally, the mutual information between different subgraphs of the same node is maximized in the latent space by contrastive learning objective function. Experimental results on a variety of datasets demonstrate the superb performance of our design. The major contributions of this paper are as follows:

  • We propose a novel framework to learn node representation through multiple node-centered subgraphs of nodes, which is a novel idea in current work to observe a single node from multiple perspectives.

  • We carefully design five node-centered subgraphs and analyze the influence of different subgraphs on the learning quality of node representation through extensive experiments, which is of reference significance.

  • We evaluated MNCSCL on five standard datasets and two downstream tasks to validate the effectiveness of the proposed method. Our experiments show that the contrastive learning of multiple subgraphs outperforms the above mentioned single-subgraph contrastive learning in terms of results.

2. Related Work

2.1. Graph Representation Learning Based on Contrastive Learning

Inspired by the advances of contrastive learning in fields such as CV and NLP, some work has started to apply contrastive learning on graph data for graph representation learning. The objective of graph contrastive learning is to maximize the MI between similar instances in a graph (Wu et al., 2021), and its model design focuses on three modules: data augmentation, proxy tasks, and contrastive objectives. Among them, the most important modules is the proxy task, which describes the definition of similar instances (i.e. positive example) and dissimilar instances (i.e. negative example). DGI (Velickovic et al., 2019) extends the idea of DIM (Hjelm et al., 2018) to graph data to learn node representations by maximizing the MI between local node representation and global graph representation. GMI (Peng et al., 2020) takes nodes and their neighbors as objects of study and maximizes the MI between hidden representation of each node and the original features of its neighboring nodes. GIC (Mavromatis and Karypis, 2020) clusters the nodes in the graph by a differentiable version of K-means clustering, and then maximizes the MI between the node representation and its corresponding cluster summaries. SUBG-CON (Jiao et al., 2020) obtains the context subgraph of each node by subgraph sampling based data augmentation, and then maximizes the consistency between them. Despite the good results achieved, these works perform graph contrastive learning only on a single perspective for nodes.

2.2. Multi-view Contrastive Learning

Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas (Li et al., 2018; Peng et al., 2022; Zhao and Chen, 2019; Zhao et al., 2020, 2021; Zhao, 2021). It has had a lot of success in areas such as computer vision. For instance, Contrastive Multiview Coding (CMC) (Tian et al., 2020) uses contrastive learning to maximize the mutual information between multiple views of a dataset to perform representation learning of images. MVGRL (Hassani and Khasahmadi, 2020) obtains multiple views of graph through data augmentation, and they find out that unlike visual representation learning, increasing the number of views of the entire graph to more than two by data augmentation does not improve performance. Unlike the multi-view graph contrastive learning summarized in MVGRL which focuses on node attributes at graph-level, our multiple node-centered subgraphs in this paper focus more on the differences in structure at node-level.

3. Methodology

Problem definition. Given a graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) with NN nodes, where 𝒱={v1,v2,,vN}\displaystyle\mathcal{V}=\{v_{1},v_{2},...,v_{N}\} and \mathcal{E} represent the node set and the edge set respectively. X={x1,x2,,xN}N×F\displaystyle\textbf{X}=\{\textbf{x}_{1},\textbf{x}_{2},...,\textbf{x}_{N}\}\in\mathbb{R}^{N\times F} is the node features matrix, where xiF\displaystyle\textbf{x}_{i}\in\mathbb{R}^{F} denotes the features of dimension FF for node viv_{i}. We use the adjacency matrix AN×N\displaystyle\textbf{A}\in\mathbb{R}^{N\times N} to represent the connectivity of the graph, where A(i,j)=1\textbf{A}(i,j)=1 if nodes viv_{i} and vjv_{j} are linked, and A(i,j)=0\textbf{A}(i,j)=0 otherwise. In this way, a graph can also be represented as 𝒢=(X,A)\mathcal{G}=(\textbf{X},\textbf{A}). If 𝒱𝒱\displaystyle\mathcal{V}^{\prime}\subset\mathcal{V} is a subset of vertices of 𝒢\mathcal{G} and \mathcal{E}^{\prime} consists of all of the edges in \mathcal{E} that have both endpoints in 𝒱\mathcal{V}, then the subgraph 𝒮=(𝒱,)\mathcal{S}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) of graph 𝒢\mathcal{G} is an induced subgraph. The subgraphs mentioned in this paper are all induced subgraphs.

Table 1. Summary of notations. The first five notations are described in detail in the following subsections.
Notation Meaning
\mathcal{F} Shared encoder
𝒞\mathcal{C} Corruption function
𝒯\mathcal{T} Subgraph generator
\mathcal{R} Readout function
𝒟\mathcal{D} Discriminator
NN Number of nodes in the input graph
FF Feature dimension of each node
FF^{\prime} Feature dimension of each node representation
KK Number of node-centered subgraphs sampled by 𝒯\mathcal{T}
NN^{\prime} Number of nodes in the corresponding node-centered subgraph
dd Range of neighbors in the neighboring subgraph
ll Number of nodes that are most similar to the central node for intimate subgraph
CC number of clusters for communal subgraph
η\eta self-weighted factor for full subgraph
𝒢,𝒢~\mathcal{G},\tilde{\mathcal{G}} Input graph and its negative example obtained by 𝒞\mathcal{C}
𝒢ik,𝒢~ik\mathcal{G}_{i}^{k},\tilde{\mathcal{G}}_{i}^{k} The kk-th node-centered subgraph of node viv_{i} and its negative example
Hik,H~ik\textbf{H}_{i}^{k},\tilde{\textbf{H}}_{i}^{k} Node representation matrix of the kk-th node-centered subgraph of node viv_{i} and its negative example
Aik,A~ik\textbf{A}_{i}^{k},\tilde{\textbf{A}}_{i}^{k} Adjacency matrix of the kk-th node-centered subgraph of node viv_{i} and its negative example
vik,uik\textbf{v}_{i}^{k},\textbf{u}_{i}^{k} Representation of the kk-th node-centered subgraph of node viv_{i} and its negative example, obtained by \mathcal{R}

The goal of self-supervised graph representation learning is to learn a encoder :N×F×N×NN×F\displaystyle\mathcal{F}:\mathbb{R}^{N\times F}\times\mathbb{R}^{N\times N}\to\mathbb{R}^{N\times F^{\prime}}, which takes the features matrix X and the adjacency matrix A as input to get the node representation H={h1,h2,,hN}N×F\textbf{H}=\{\textbf{h}_{1},\textbf{h}_{2},...,\textbf{h}_{N}\}\in\mathbb{R}^{N\times F^{\prime}} without label information, formulated as H=(X,A)\textbf{H}=\mathcal{F}(\textbf{X},\textbf{A}). The learned node representation H can be used directly for downstream tasks such as node classification and link prediction. For the sake of clarity, we list all important notations in Table 1.

Overall framework. Inspired by recent graph representation learning work based on contrastive learning, we propose MNCSCL algorithm for graph representation learning by maximizing MI of multiple node-centered subgraphs of nodes. As illustrated in Figure 2, if there is only a single graph provided as input, the summarized steps of MNCSCL are as follows:

  • Utilize a corruption function 𝒞\mathcal{C} to perturb the structure and attributes of the input graph 𝒢\mathcal{G} to obtain a negative example 𝒢~=(X~,A~)𝒞(X,A)\displaystyle\tilde{\mathcal{G}}=(\tilde{\textbf{X}},\tilde{\textbf{A}})\sim\mathcal{C}(\textbf{X},\textbf{A}).

  • Pass input graph 𝒢\mathcal{G} and negative example 𝒢~\tilde{\mathcal{G}} into a shared encoder \mathcal{F} to get node representation H and H~\tilde{\textbf{H}}.

  • Use a subgraph generator 𝒯\mathcal{T} to sample a series of node-centered subgraphs Gi={𝒢i1,𝒢i2,,𝒢iK}\displaystyle\textbf{G}_{i}=\{\mathcal{G}_{i}^{1},\mathcal{G}_{i}^{2},...,\mathcal{G}_{i}^{K}\} for node viv_{i} from the input graph 𝒢\mathcal{G}, where KK is the number of different subgraphs and 𝒢ik=(Hik,Aik)\mathcal{G}_{i}^{k}=(\textbf{H}_{i}^{k},\textbf{A}_{i}^{k}), k=1,2,,Kk=1,2,...,K. The corresponding node-centered subgraphs set G~i={𝒢~i1,𝒢~i2,,𝒢~iK}\displaystyle\tilde{\textbf{G}}_{i}=\{\tilde{\mathcal{G}}_{i}^{1},\tilde{\mathcal{G}}_{i}^{2},...,\tilde{\mathcal{G}}_{i}^{K}\} from the negative example 𝒢~\tilde{\mathcal{G}} are further obtained according to Gi\textbf{G}_{i}.

  • Summarize all subgraphs in Gi\textbf{G}_{i} and G~i\tilde{\textbf{G}}_{i} through a readout function \mathcal{R} to get their representations Vi={vi1,vi2,,viK}\displaystyle\textbf{V}_{i}=\{\textbf{v}_{i}^{1},\textbf{v}_{i}^{2},...,\textbf{v}_{i}^{K}\} of node viv_{i} and the corresponding negative samples Ui={ui1,ui2,,uiK}\displaystyle\textbf{U}_{i}=\{\textbf{u}_{i}^{1},\textbf{u}_{i}^{2},...,\textbf{u}_{i}^{K}\}. As an example, vik=(Hik)\textbf{v}_{i}^{k}=\mathcal{R}(\textbf{H}_{i}^{k}).

  • Update parameters of \mathcal{F}, \mathcal{R} and 𝒟\mathcal{D} (mentioned later) by applying gradient descent to maximize Eq. (13) or Eq. (14).

In the following sections, we will elaborate on the crucial components mentioned above.

3.1. Subgraph Generator

We can deal with a whole graph from different views (Hassani and Khasahmadi, 2020), and the same is true for any node in the graph. For a single node, there are many subgraphs centered on it (e.g., ego network (Zimmermann and Nagappan, 2008)). If these node-centered regional subgraphs have certain semantic information (e.g., ego network represents a specific individual and other persons who have a social relationship with him), then we can treat them as different perspectives of the central node.

The main role of the subgraph generator 𝒯\mathcal{T} is to sample these node-centered regional subgraphs from 𝒢\mathcal{G} and generate the corresponding negative samples from 𝒢~\tilde{\mathcal{G}}. Specifically, for a specific node viv_{i} and a prepared node-centered subgraph type kk, the subgraph generator 𝒯\mathcal{T} first gets idxidx, a set which represents the index of chosen nodes for subgraph 𝒢ik\mathcal{G}_{i}^{k} to be obtained. Then, the node representation matrix HikN×F\textbf{H}_{i}^{k}\in\mathbb{R}^{N^{\prime}\times F^{\prime}} and adjacency matrix AikN×N\textbf{A}_{i}^{k}\in\mathbb{R}^{N^{\prime}\times N^{\prime}} of 𝒢ik\mathcal{G}_{i}^{k} are denoted respectively as

(1) Hik=Hidx,:,Aik=Aidx,idx,\textbf{H}_{i}^{k}=\textbf{H}_{idx,:},\ \textbf{A}_{i}^{k}=\textbf{A}_{idx,idx},

where idx\cdot_{idx} is an indexing operation and NN^{\prime} is the length of idxidx.

In this way, we can obtain the kk-th node-centered subgraph 𝒢ik=(Hik,Aik)𝒯(H,A)\displaystyle\mathcal{G}_{i}^{k}=(\textbf{H}_{i}^{k},\textbf{A}_{i}^{k})\sim\mathcal{T}(\textbf{H},\textbf{A}) for any specific node viv_{i}. Likewise, the corresponding negative example can be obtained by 𝒢~ik=(H~ik,A~ik)=(H~idx,:,A~idx,idx)\displaystyle\tilde{\mathcal{G}}_{i}^{k}=(\tilde{\textbf{H}}_{i}^{k},\tilde{\textbf{A}}_{i}^{k})=(\tilde{\textbf{H}}_{idx,:},\tilde{\textbf{A}}_{idx,idx}).

Refer to caption
Figure 3. Five types of node-centered subgraphs for node v4v_{4} from original graph. Note that nodes are divided into 3 clusters in (4), which are {v1,v2}\{v_{1},v_{2}\}, {v3,v4}\{v_{3},v_{4}\} and {v5,v6}\{v_{5},v_{6}\}.

3.2. Node-centered Subgraphs Design

To learn a more comprehensive representation, we carefully design five different node-centered subgraphs, as illustrated in Figure 3. The details of them are as follows:

Subgraph 1: Basic subgraph. The basic subgraph only contains the central node itself (i.e., N=1N^{\prime}=1), that means for each node viv_{i}:

(2) idx={i}.idx=\{i\}.

Further, we can get the basic subgraph representation and its corresponding negative example by vi1=hi\textbf{v}_{i}^{1}=\textbf{h}_{i} and ui1=h~i\textbf{u}_{i}^{1}=\tilde{\textbf{h}}_{i}. For any specific node, basic subgraph is the purest “subgraph” as well as the main subgraph, which contains the most concentrated features of the node iteself.

Subgraph 2: Neighboring subgraph. The neighbors of the central node are often closely related to the node in structure, and study with them can better capture the structural features of node (Hamilton et al., 2017a). The neighboring subgraph contains all nodes with a distance less than or equal to dd from the central node viv_{i}, denoted as

(3) idx={j|dis(vi,vj)d},idx=\{j|dis(v_{i},v_{j})\leq d\},

where dis(,)dis(\cdot,\cdot) is a function used to get the distance between two nodes. After obtaining the neighboring subgraph 𝒢i2\mathcal{G}_{i}^{2} by Eq. (1), a readout function :N×FF\displaystyle\mathcal{R}:\mathbb{R}^{N\times F^{\prime}}\to\mathbb{R}^{F^{\prime}} is used to obtain the neighboring subgraph representation and its corresponding negative example:

(4) vi2=(Hi2),ui2=(H~i2).\textbf{v}_{i}^{2}=\mathcal{R}(\textbf{H}_{i}^{2}),\ \textbf{u}_{i}^{2}=\mathcal{R}(\tilde{\textbf{H}}_{i}^{2}).

It is beneficial to learn a more comprehensive representation when we take a larger range of neighbors. But at the same time, the features of the central node will be weakened, which will cause the model to be more inclined to learn the representation of a region or even the whole graph. It follows that it’s important to choose the value of dd. As shown in the Figure 5, the model reaches the best performance when d=1d=1 for the neighboring subgraph.

Subgraph 3: Intimate subgraph. The intimate subgraph takes into account the similarity that actually exists between two nodes in the input graph 𝒢\mathcal{G}. It contains the first ll nodes that are most similar to the central node. This is equivalent to identifying the nodes that are structurally close to the central node from another perspective, and thus learning the structural features of the nodes more comprehensively.

The similarity between nodes is usually measured by a similarity scores matrix SN×N\textbf{S}\in\mathbb{R}^{N\times N}, where S(i,j)\textbf{S}(i,j) measures the similarity between nodes viv_{i} and vjv_{j}. Here we follow the personalized pagerank (PPR) algorithm (Jeh and Widom, 2003) as introduced in (Zhang et al., 2020). The similarity scores matrix S based on the PPR algorithm can be denoted as

(5) S=α(I(1α)A¯)1,\textbf{S}=\alpha\cdot(\textbf{I}-(1-\alpha)\cdot\bar{\textbf{A}})^{-1},

where I is the identity matrix and A¯=AD1\bar{\textbf{A}}=\textbf{A}\textbf{D}^{-1} denotes the colum-normalized adjacency matrix. D is the diagonal matrix corresponding to A with D(i,i)=jA(i,j)\textbf{D}(i,i)=\sum_{j}\textbf{A}(i,j) on its diagonal. α[0,1]\alpha\in[0,1] is a parameter which is set as 0.15 in (Jiao et al., 2020). For a specific node viv_{i}, the subgraph generator 𝒯\mathcal{T} chooses its top-ll similar nodes (i.e., N=lN^{\prime}=l) to generate intimate subgraph with S(i,:)\textbf{S}(i,:), which can be denoted as

(6) idx=top_rank(S(i,:),l),idx=top\_rank(\textbf{S}(i,:),l),

where top_rank()top\_rank(\cdot) is a function that selects the top-ll values from a vector and return the corresponding indices. Same as Subgraph 2 subsection, we can obtain the intimate subgraph representation and its corresponding negative example with vi3=(Hi3)\textbf{v}_{i}^{3}=\mathcal{R}(\textbf{H}_{i}^{3}) and ui3=(H~i3)\textbf{u}_{i}^{3}=\mathcal{R}(\tilde{\textbf{H}}_{i}^{3}).

Subgraph 4: Communal subgraph. In graph clustering, nodes in a uniform cluster tend to have similarity in attributes. Therefore, we can select all nodes in the cluster to which the central node belongs to get the communal subgraph. These nodes with similar attributes to the central node can help the model better learn the attribute features of the central node.

It is particularly noteworthy that the other subgraphs are obtained independently of the node attributes (i.e., H), but the communal subgraph is attribute-related. Since H will change during training, a fixed idxidx before the model training as other subgraphs may lead to undesirable results.

There are already many methods to perform graph clustering (Schaeffer, 2007). We tried three different clustering strategies based on KK-means clustering (MacQueen, 1967) due to the stable and excellent performance of it.

  • Strategy 1: Precomputed K-means. Same as other subgraphs, the node-centered subgraphs are sampled before model training starts. Specifically, the traditional KK-means clustering algorithm is used to cluster the nodes in the input graph 𝒢\mathcal{G}. For any specific node viv_{i}, take all the indices of nodes in its cluster as idxidx and further compute vi4\textbf{v}_{i}^{4} and ui4\textbf{u}_{i}^{4}.

  • Strategy 2: A differentiable version of K-means. To update the communal subgraph during training, we need an end-to-end clustering algorithm. Here we follow a differentiable version of K-means as introduced in (Wilder et al., 2019). For each node viv_{i}, let μc\mathbf{\mu}_{c} denote the center of cluster cc and γ^ic\hat{\mathbf{\gamma}}_{ic} (s.t., cγ^ic=1,i\sum_{c}\hat{\mathbf{\gamma}}_{ic}=1,\forall i) denotes the degree to which node viv_{i} is assigned to cluster cc. Suppose that the number of clusters to be obtained is CC, ClusterNet updates μc\mathbf{\mu}_{c} via an iterative process by alternately setting.

    (7) μc=iγ^ichiiγ^icc=1,,C,\mathbf{\mu}_{c}=\frac{\sum_{i}\hat{\mathbf{\gamma}}_{ic}\textbf{h}_{i}}{\sum_{i}\hat{\mathbf{\gamma}}_{ic}}\ \ c=1,...,C,

    and

    (8) γ^ic=exp(βsim(hi,μc))cexp(βsim(hi,μc))c=1,,C,\hat{\mathbf{\gamma}}_{ic}=\frac{exp(-\beta\cdot sim(\textbf{h}_{i},\mathbf{\mu}_{c}))}{\sum_{c}exp(-\beta\cdot sim(\textbf{h}_{i},\mathbf{\mu}_{c}))}\ \ c=1,...,C,

    where β\beta is an inverse-temperature hyperparameter, the standard K-means assignment is recovered when β\beta\to\infty. sim(,)sim(\cdot,\cdot) denotes a similarity function between two instances. Eventually, we can get the communal subgraph representation by

    (9) vi4=σ(c=1Cγ^icμc),\textbf{v}_{i}^{4}=\sigma\left(\sum_{c=1}^{C}\hat{\mathbf{\gamma}}_{ic}\mathbf{\mu}_{c}\right),

    where σ()\sigma(\cdot) is the logistic sigmoid nonlinearity. Since this method does not get idxidx, we simply get negative example {u14,u24,,uN4}\displaystyle\{\textbf{u}_{1}^{4},\textbf{u}_{2}^{4},...,\textbf{u}_{N}^{4}\} of all nodes by row-wise shuffling of {v14,v24,,vN4}\displaystyle\{\textbf{v}_{1}^{4},\textbf{v}_{2}^{4},...,\textbf{v}_{N}^{4}\}.

  • Strategy 3: An end-to-end version of K-means with an estimation network. In this way, we replace the iterative process in Strategy 2 with an estimation network, which utilizes a multi-layer neural network to directly predict the degree to which each node belongs to each cluster γ^N×C\displaystyle\hat{\mathbf{\gamma}}\in\mathbb{R}^{N\times C}, denoted as

    (10) γ^=softmax(MLP(H;θ)),\hat{\mathbf{\gamma}}=softmax(MLP(\textbf{H};\theta)),

    where softmax()softmax(\cdot) is a softmax nonlinearity and MLP()MLP(\cdot) is a multi-layer neural network with trainable parameters θ\theta. Then we get the communal subgraph representation by Eq. (8) and Eq. (9) as same as Strategy 2.

The comparison of the three strategies is shown in Figure 5. After weighing both accuracy and efficiency, we chose Strategy 2 to generate the communal subgraph.

Refer to caption
Figure 4. Take the five node-centered subgraphs of node viv_{i} as an example. (a) The “core view” (left) and “full graph” (right) paradigms. The numbers within the regions represent the number of MI in this region. For example, if we select all 5 node-centered subgraphs under the full graph case, MI will be calculated once between every two subgraphs, and hence ismarked with the number 10. (b) Contrastive lossess under core view and full graph cases. Refer to Section 3.3 for more details.
Table 2. The statistics of all the five datasets. *Note that the node classification on PPI dataset is a multilabel classification problem.
Task Dataset Type #Nodes #Edges #Features #Classes Train / Val / Test
Node classification & Link prediction (Transductive) Cora Citation network 2,708 5,429 1,433 7 0.05 / 0.18 / 0.37
Citeseer Citation network 3,327 4,732 3,703 6 0.04 / 0.15 / 0.30
Pubmed Citation network 19,717 44,338 500 3 0.003 / 0.03 / 0.05
Node classification (Inductive) Reddit Social network 232,965 11,606,919 602 41 0.66 / 0.10 / 0.24
PPI Protein network 56,944 818,716 50 121* 0.79 / 0.11 / 0.10

Subgraph 5: Full subgraph. To learn a comprehensive representation of a node, it is essential to observe it from a global perspective. The full subgraph contains all the nodes (i.e. N=NN^{\prime}=N) in the input graph 𝒢\mathcal{G}, e.g., for any specific node viv_{i},

(11) idx={j|j=1,2,,N}.idx=\{j|j=1,2,...,N\}.

Compared with the previous subgraphs, the full subgraph contains far more nodes than they do, which extremely weakens the specificity of the central node. At the same time, it is not conducive to learning a specialized node representation as all nodes have the same full subgraph. Based on these ideas, we propose full subgraph for specific node viv_{i} with self-weighted, denoted as

(12) vi5=(1η)(Hi5)+ηhi,\textbf{v}_{i}^{5}=(1-\eta)\mathcal{R}(\textbf{H}_{i}^{5})+\eta\textbf{h}_{i},

where η[0,1]\eta\in[0,1] is a self-weighted factor.

So far, we have introduced five carefully designed node-centered subgraphs. It is noted that subgraphs other than Subgraph 4 can be precomputed before model training starts, which allows us to quickly obtain these subgraphs during training by performing only one calculation before training.

3.3. Contrastive Loss

The key idea of self-supervised comparative learning is to define a proxy task to generate positive and negative samples. The encoder \mathcal{F} that generates the node representation is trained by contrast between positive and negative samples.

To handle multiple subgraphs we obtained before, we take the “core view” (CV) and “full graph” (FG) paradigms as introduced in CMC (Tian et al., 2020), as shown in Figure 4.(a). Further, the contrastive lossess under core view and full graph cases are illustrated in Figure 4.(b). Specifically, in the core view case, we regard Subgraph 1 as the most critical node-centered subgraph. It and its corresponding negative sample constitute positive and negative pairs with Subgraph 2\thicksim5, respectively. As for the full graph case, any two between Subgraph 1\thicksim5 constitute positive pairs, and Subgraph 1\thicksim5 respectively form negative pairs with their corresponding negative examples.

After defining the proxy task, we follow the intuitions from DGI (Velickovic et al., 2019) and use a noise-contrastive type objective with a standard binary cross-entropy (BCE) loss between positive examples and negative examples. MNCSCL’s objective under core view case is

(13) CV=i=1Nj=2K𝔼(X,A)[log𝒟(vi1,vij)]+i=1Nj=2K𝔼(X~,A~)[log(1𝒟(ui1,vij))],\begin{split}\mathcal{L}_{CV}=\sum_{i=1}^{N}\sum_{j=2}^{K}\mathbb{E}_{(\textbf{X},\textbf{A})}\left[log\mathcal{D}(\textbf{v}_{i}^{1},\textbf{v}_{i}^{j})\right]\\ +\sum_{i=1}^{N}\sum_{j=2}^{K}\mathbb{E}_{(\tilde{\textbf{X}},\tilde{\textbf{A}})}\left[log(1-\mathcal{D}(\textbf{u}_{i}^{1},\textbf{v}_{i}^{j}))\right],\end{split}

where KK is the number of selected node-centered subgraphs and 𝒟:F×F\mathcal{D}:\mathbb{R}^{F^{\prime}}\times\mathbb{R}^{F^{\prime}}\to\mathbb{R} is a discriminator which is used for estimating the MI by assigning higher scores to positive examples than negatives. When using the full graph case, the objective becomes

(14) FG=i=1Nj=1K1k=j+1K𝔼(X,A)[log𝒟(vij,vik)]+i=1Nj=1K𝔼(X~,A~)[log(1𝒟(vij,uij))].\begin{split}\mathcal{L}_{FG}=\sum_{i=1}^{N}\sum_{j=1}^{K-1}\sum_{k=j+1}^{K}\mathbb{E}_{(\textbf{X},\textbf{A})}\left[log\mathcal{D}(\textbf{v}_{i}^{j},\textbf{v}_{i}^{k})\right]\\ +\sum_{i=1}^{N}\sum_{j=1}^{K}\mathbb{E}_{(\tilde{\textbf{X}},\tilde{\textbf{A}})}\left[log(1-\mathcal{D}(\textbf{v}_{i}^{j},\textbf{u}_{i}^{j}))\right].\end{split}

The Eq. (13) and Eq. (14) are used as the contrastive loss in the experiments respectively.

4. Experiments

Table 3. The classification accuracy (in %) on the transductive datasets and the micro-averaged F1 (×100\times 100) on the inductive datasets of the node classification task. Some results are directly taken from their original papers (DGI, GMI#inductive, GIC and GRACE#inductive), and other compared results are taken from (Jiao et al., 2020; Mo et al., 2022). The second column is the data used in the training process (X: features matrix, A: adjacency matrix, Y: labels). The best result for each dataset is indicated by bolded.
Method Input Transductive Inductive
X A Y Cora Citeseer Pubmed Reddit PPI
Raw features 56.6 ± 0.4 57.8 ± 0.2 69.1 ± 0.2 58.5 ± 0.1 42.5 ± 0.3
Deep Walk 67.2 43.2 65.3 32.4 52.9
GCN 81.4 ± 0.6 70.3 ± 0.7 76.8 ± 0.6 93.3 ± 0.1 51.5 ± 0.6
FastGCN 78.0 ± 2.1 63.5 ± 1.8 74.4 ± 0.8 89.5 ± 1.2 63.7 ± 0.6
DGI 82.3 ± 0.6 71.8 ± 0.7 76.8 ± 0.6 94.0 ± 0.1 63.8 ± 0.2
GMI 83.0 ± 0.2 72.4 ± 0.2 79.9 ± 0.4 95.0 ± 0.02 65.0 ± 0.02
GIC 81.7 ± 0.8 71.9 ± 0.9 77.4 ± 0.5 - -
GRACE 83.1 ± 0.2 72.1 ± 0.1 79.6 ± 0.5 94.2±0.0 66.2±0.1
MVGRL 82.9 ± 0.3 72.6 ± 0.4 80.1 ± 0.7 - -
MNCSCL-FG 84.3 ± 0.5 73.2 ± 0.6 80.0 ± 0.4 95.2 ± 0.1 67.3 ± 0.2
MNCSCL-CV 84.7 ± 0.3 73.8 ± 0.5 81.5 ± 0.4 95.8 ± 0.1 67.1 ± 0.2

4.1. Experimental Settings

Datasets. We use 5 commonly used benchmark datasets in the previous work (Hamilton et al., 2017a; Mavromatis and Karypis, 2020) for node classification and link prediction downstream tasks, including 3 transductive citation networks (i.e., Cora, Citeseer, and Pubmed), a inductive large social network (i.e., Reddit) and a inductive protein-protein interaction dataset that contains multiple graphs with multiple labels (i.e., PPI).

  • Cora, Citeseer, and PubMed are all citation networks, with Cora and Citeseer focusing on papers in computer science and information science, and Pubmed containing a large amount of literature information in medical and life science fields. They represent citation relationships between papers through graph data structure, where each node represents a paper and the edges represent the citation relationships between papers.

  • Reddit is a collection of information from Reddit, the world’s largest social news aggregation, discussion and community site. the Reddit dataset provides a large amount of user-generated content, including posts, comments, polls and more. In Reddit, posts are represented as nodes, and the connections between them correspond to user comments.

  • PPI is a protein-protein interaction dataset that contains multiple graphs with multiple labels. PPI contains the interaction relationships between proteins that can form a network or graph structure. Each node represents a protein, while edges indicate interactions between proteins.

We use all five datasets in the node classification task and follow the settings in the division of the training set and the test set as same as (Velickovic et al., 2019). In the link prediction task, we use Cora, Citeseer, and Pubmed datasets and follow the setup described in (Kipf and Welling, 2016b). The statistics of all the datasets are shown in Table 2.

Baseline methods. In node classification task, the compared methods include direct use of row features, 1 traditional unsupervised algorithm (i.e., Deep Walk (Perozzi et al., 2014)), two supervised graph neural networks (i.e., GCN (Kipf and Welling, 2016a) and FastGCN (Chen et al., 2018)) and 5 state-of-the-art self-supervised methods(i.e., DGI, GMI, GIC, GRACE (Zhu et al., 2020) and MVGRL (Hassani and Khasahmadi, 2020)).

  • DGI is one of the classical methods of graph representation learning based on contrastive learning. It aims to maximize the MI between the local perspective and the global perspective of the input graph, as well as the corresponding corrupted graph.

  • GMI draws on the ideas of DGI, but rather than employing a corruption function, this approach focuses on maximizing the MI between the hidden representations of nodes and their original local structure.

  • GIC is also inspired by DGI, its objective is to maximize the MI between the node’s representation and the representation of the cluster to which it is assigned.

  • GRACE propose a novel framework for unsupervised graph representation learning by leveraging a contrastive objective at the node level. In order to enhance the contrast effect, they created two sets of negative pairs, one within the same view and the other across different views.

  • MVGRL performs self-supervised learning by contrasting structural views of graphs, where they contrast first-order neighbors of nodes as well as a graph diffusion, with good results.

In link prediction, we directly follow the effective link prediction methods used in GIC (i.e., Deep Walk, Spectral Clustering (SC) (Tang and Liu, 2011), VGAE (Kipf and Welling, 2016b), ARGVA (Pan et al., 2019), DGI and GIC).

  • Spectral Clustering is initially introduced to solve the node partitioning problem in graph analysis. It has demonstrated satisfactory performance across diverse domains, such as graphs, text, images, and microarray data. Its effectiveness has been widely acknowledged in these areas.

  • VGAE is an unsupervised learning framework designed for graph-structured data, utilizing the variational auto-encoder (VAE) methodology. In VGAE, a GNN-based encoder is employed to generate node embeddings, while a straightforward decoder is used to reconstruct the adjacency matrix.

  • ARGVA is a graph embedding framework specifically designed for graph data, incorporating adversarial learning techniques. Similar to VGAE, ARGVA adopts a similar structure, but it learns the underlying data distribution through an adversarial approach.

Evaluation metrics. For the node classification task, we classify the test set by a logistic regression classifier, and then evaluate the performance using classification accuracy for transductive datasets (i.e., Cora, Citeseer and Pubmed) and micro-averaged F1 score for inductive datasets (i.e., Reddit and PPI). Suppose that TP, FN, FP and TN represent the number of true positives, false negatives, false positives, and true negatives, respectively. Then classification accuracy can be calculated by accuracy=(TP+TN)/(TP+FP+TN+FN)accuracy=(TP+TN)/(TP+FP+TN+FN). Also micro-averaged F1 score can be calculated by F1Score=2precisionrecall/(precision+recall)F1-Score=2*precision*recall/(precision+recall), where precision=TP/(TP+FP)precision=TP/(TP+FP) and recall=TP/(TP+FN)recall=TP/(TP+FN). For the link prediction task, we use the AUC score (the area under ROC curve) and the AP score (the area under Precision-Recall curve) for evaluation. The closer the AUC score and the AP score approaches 1, the better the performance of the algorithm is.

Training strategy. We implement MNCSCL using PyTorch (Ketkar and Moolayil, 2021) on 4 NVIDIA GeForce RTX 3090 GPUs and use Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001 (specially, 0.0001 for Reddit) during training. We follow settings in DGI that using an early stopping strategy with a patience of 20 epochs for transductive datasets and a fixed number of epochs (150 on Reddit, 20 on PPI) for inductive datasets. For large graphs, we adopt the sampling strategy used by GraphSAGE (Hamilton et al., 2017a).

Table 4. The AUC scores (in %) of the link prediction task. The results of the compared methods are replicated from (Mavromatis and Karypis, 2020). The best result for each dataset is indicated by bolded.
Method Cora Citeseer Pubmed
AUC AP AUC AP AUC AP
DeepWalk 83.1 ± 0.01 85.0 ± 0.00 80.5 ± 0.02 83.6 ± 0.01 84.4 ± 0.00 84.1 ± 0.00
Spectral Clustering 84.6 ± 0.01 88.5 ± 0.00 80.5 ± 0.01 85.0 ± 0.01 84.2 ± 0.02 87.8 ± 0.01
VGAE 91.4 ± 0.01 92.6 ± 0.01 90.8 ± 0.02 92.0 ± 0.02 96.4 ± 0.00 96.5 ± 0.00
ARGVA 92.4 ± 0.004 93.2 ± 0.003 92.4 ± 0.003 93.0 ± 0.003 96.8 ± 0.001 97.1 ± 0.001
DGI 89.8 ± 0.8 89.7 ± 1.0 95.5 ± 1.0 95.7 ± 1.0 91.2 ± 0.6 92.2 ± 0.5
GIC 93.5 ± 0.6 93.3 ± 0.7 97.0 ± 0.5 96.8 ± 0.5 93.7 ± 0.3 93.5 ± 0.3
MNCSCL-CV 94.8 ± 0.4 94.2 ± 0.6 97.7 ± 0.4 97.2 ± 0.5 94.8 ± 0.2 95.4 ± 0.4

4.2. Implementation Details

Encoder design. For transductive datasets, we adopt a one-layer Graph Convolutional Network (GCN) as our encoder, with the following propagation rule:

(15) (X,A)=σ(D^12A^D^12XW),\mathcal{F}(\textbf{X},\textbf{A})=\sigma(\hat{\textbf{D}}^{-\frac{1}{2}}\hat{\textbf{A}}\hat{\textbf{D}}^{-\frac{1}{2}}\textbf{X}\textbf{W}),

where A^=A+IN\hat{\textbf{A}}=\textbf{A}+\textbf{I}_{N} is the adjacency matrix with self-loops and D^(i,i)=jA^(i,j)\hat{\textbf{D}}(i,i)=\sum_{j}\hat{\textbf{A}}(i,j) is its corresponding degree matrix. σ()\sigma(\cdot) is the PReLU nonlinearity (He et al., 2015) and W is a learnable parameter matrix with F=512F^{\prime}=512 (specially, F=256F^{\prime}=256 on Pubmed). As for inductive datasets, we adopt a one-layer GCN with skip connections (Zhang and Meng, 2019) as our encoder, with the following propagation rule:

(16) (X,A)=σ(D^12A^D^12XW+A^Wskip),\mathcal{F}(\textbf{X},\textbf{A})=\sigma(\hat{\textbf{D}}^{-\frac{1}{2}}\hat{\textbf{A}}\hat{\textbf{D}}^{-\frac{1}{2}}\textbf{X}\textbf{W}+\hat{\textbf{A}}\textbf{W}_{skip}),

where Wskip\textbf{W}_{skip} is a learnable parameter matrix with F=512F^{\prime}=512 for skip connections.

Corruption function. For transductive datasets,we transform adjacency matrix A to a diffusion matrix U. Specifically, we compute diffusion using fast approximation and sparsification methods (Gasteiger et al., 2019) with heat kernel (Kondor and Lafferty, 2002):

(17) U=exp(tAD1t),\textbf{U}=\exp{(t\textbf{AD}^{-1}-t)},

where D is a diagonal degree matrix as in Section 3.1 and tt is diffusion time (Gasteiger et al., 2019). For Reddit dataset, we implement the corruption function 𝒞\mathcal{C} by keeping the adjacency matrix A unchanged (i.e. A~=A\tilde{\textbf{A}}=\textbf{A}) and perturbing the feature matrix X by row-wise shuffling. And for PPI dataset, we simply samples a different graph from the training set due to it’s a multiple-graph dataset.

Table 5. The classification accuracy (in %) of different node-centered subgraphs combinations (Subgraph 1: Basic subgraph, Subgraph 2: Neighboring subgraph, Subgraph 3: Intimate subgraph, Subgraph 4: Communal subgraph, Subgraph 5: Full subgraph) on Cora, Citeseer and Pubmed datasets with same hyperparameter setting. Note that due to the use of the “core view” paradigm, there must be at least the basic subgraph and another subgraph. The best result for each dataset is indicated by bolded.
Subgraphs Dataset
1 2 3 4 5 Cora Citeseer Pubmed
82.5 ± 0.7 72.2 ± 0.7 77.5 ± 0.1
82.8 ± 0.3 72.5 ± 0.7 78.6 ± 0.9
82.7 ± 0.6 72.1 ± 0.6 78.2 ± 0.4
82.7 ± 0.5 71.8 ± 0.6 78.1 ± 0.6
AVG of 2 Subgraphs 82.7 ± 0.6 72.2 ± 0.7 78.2 ± 0.6
83.1 ± 0.5 72.7 ± 0.9 78.3 ± 0.8
82.9 ± 0.9 72.3 ± 0.5 78.0 ± 0.9
83.2 ± 0.4 72.1 ± 0.4 77.6 ± 0.8
82.9 ± 0.5 72.0 ± 0.7 78.6 ± 0.4
83.0 ± 0.5 72.9 ± 0.5 78.7 ± 0.8
83.1 ± 0.8 72.9 ± 0.6 78.5 ± 0.6
AVG of 3 Subgraphs 83.0 ± 0.6 72.5 ± 0.7 78.3 ± 0.8
83.5 ± 0.4 72.6 ± 0.7 79.2 ± 0.4
83.5 ± 0.4 73.0 ± 0.6 78.8 ± 0.9
83.6 ± 0.6 72.4 ± 0.6 78.5 ± 0.8
83.3 ± 0.7 73.1 ± 1.0 78.6 ± 0.6
AVG of 4 Subgraphs 83.5 ± 0.5 72.7 ± 0.8 78.7 ± 0.7
84.7 ± 0.3 73.8 ± 0.5 81.5 ± 0.4

Readout function. We use identical readout function \mathcal{R} for all datasets, which performs a simple average of all node features for a given subgraph with NN^{\prime} nodes:

(18) (H)=σ(1Ni=1Nhi),\mathcal{R}(\textbf{H})=\sigma\left(\frac{1}{N^{\prime}}\sum_{i=1}^{N^{\prime}}\textbf{h}_{i}\right),

where σ()\sigma(\cdot) is the logistic sigmoid nonlinearity.

Discriminator. We use a simple bilinear scoring function as discriminator 𝒟\mathcal{D}:

(19) 𝒟(hi,hj)=σ(hiTWdhj),\mathcal{D}(\textbf{h}_{i},\textbf{h}_{j})=\sigma(\textbf{h}_{i}^{T}\textbf{W}_{d}\textbf{h}_{j}),

where Wd\textbf{W}_{d} is a learnable scoring matrix and σ()\sigma(\cdot) is the logistic sigmoid nonlinearity.

Details of node-centered subgraphs. We conduct experiments with all five node-centered subgraphs on two different contrastive losses (named MNCSCL-FG under full graph case and MNCSCL-CV under core view case, respectively). More specifically, we choose d=1d=1 for the neighboring subgraph and Strategy 2 for the communal subgraph. For hyperparameters, we set the size of intimate subgraph ll as 20 (specially, 10 on citeseer). The number of clusters CC and inverse-temperature hyperparameter β\beta for communal subgraph is set to 128 and 10 respectively. To build a proper full subgraph, we also set the self-weighted factor as 0.01.

4.3. Experimental Results and Analysis

Node classification. Experiments show that MNCSCL achieves the best performance on all five datasets compared to other competing self-supervised methods, as shown in Table 3. We believe this robust performance is due to our comparison of multiple node-centered subgraphs, resulting in learning a more comprehensive node representation. Although MNCSCL-FG and MNCSCL-CV both have shown excellent performance, MNCSCL-FG performs better on PPI dataset and MNCSCL-CV performs better on other datasets. We think this is due to the very sparse available features on the PPI (over 40% of the nodes have all-zero features). It is more effective to use more perspectives for comparison on such sparse graph datasets. For other datasets, the contrastive loss under core view case is enough for MNCSCL to learn comprehensive information about the nodes, too much comparison will instead lead to overfitting and waste of resources. Compared to supervised methods, MNCSCL also outperforms the two compared methods on all datasets. This shows that our method is also very competitive compared to traditional supervision methods.

Link prediction. To test the generalization capability of MNCSCL, we intend to further investigate the performance of MNCSCL in link prediction task, as shown in Table 4. We find that MNCSCL outperforms all competing methods on both the Cora and Citeseer datasets, suggesting that a multiple node-centered subgraphs based comparison can help the model learn node representations with good generalizability. The excellent performance on different downstream tasks further proves the feasibility of our method.

Refer to caption
Figure 5. Heat map of classification accuracy (in %) on three transductive datasets when choosing different range of neighbors in the neighboring subgraph (d=1d=1, d=2d=2 and d=3d=3) and different clustering strategies in the communal subgraph (S-1, S-2 and S-3, where S-1 denotes Strategy 1, S-2 denotes Strategy 2 and S-3 denotes Strategy 3). The bolded value in each sub-plot represents the maximum classification accuracy for the current dataset.

4.4. Ablation Study

Node-centered subgraph combination. To investigate how the number of node-centered subgraphs affects the performance of MNCSCL, we permuted and combined five previously mentioned node-centered subgraphs under the core view case and observed the classification accuracy of different subgraph combinations on Cora, Citeseer and Pubmed datasets with same hyperparameter setting, as shown in Table 5. Obviously, as the number of node-centered subgraphs increases, the classification accuracy continues to increase, and the best results are achieved when all five subgraphs are used. This indicates that multiple node-centered subgraphs contrastive learning can indeed learn better node representation. We also notice that the classification accuracy improvement is more obvious with the increase in the number of node-centered subgraphs.

Range of neighbors and clustering strategy. We investigate the value of dd in the neighboring subgraph and the selection of different clustering strategies in the communal subgraph, and the results are shown in Fig 5. Here we use all five node-centered subgraphs with the contrastive loss under core view case. It can be seen that the optimal choice in all three datasets is d=1d=1 neighbors and Strategy 2. Our analyses are as follows.

  • For the neighboring subgraph, the classification accuracy tends to decrease as the vale of dd increases. We believe that too many neighboring nodes will lead to cause overfitting of the model and performance degradation. This viewpoint can also be verified from the pubmed dataset, where it is not obvious whether the classification accuracy is better or worse when setting d=1d=1 and d=2d=2. Because the attributes of pubmed are relatively sparse, sometimes its neighboring subgraph needs to contain more neighbors to get better results.

  • In the choice of clustering strategy, Strategy 1 is significantly less effective than the other two end-to-end strategies. Strategy 2 and Strategy 3 show comparable performance, but considering that Strategy 3 involves an estimation network, which means more resource consumption, we finally choose Strategy 2 as the clustering strategy for the communal subgraph.

5. Conclusion

We propose Multiple Node-centered Subgraphs Contrastive Representation Learning (MNCSCL), a novel approach to self-supervised graph representation learning. MNCSCL obtains five different node-centered subgraphs carefully designed by us through a subgraph generator on each node, and maximizes the mutual information between them through two types of contrastive loss, thus allowing us to obtain comprehensive node representation that combines information from multiple node-centered subgraphs of nodes. Experiments show that MNCSCL reach the advanced level of self-supervised learning in both transductive and inductive node classification tasks as well as in link prediction task.

Acknowledgements.
This work is supported by the NSFC program (No. 62272338) and the Natural Science Foundation of Inner Mongolia Autonomous Region of China (No.2022LHMS06008).

References

  • (1)
  • Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
  • Gasteiger et al. (2019) Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion improves graph learning. Advances in neural information processing systems 32 (2019).
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
  • Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017a. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
  • Hamilton et al. (2017b) William L Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 (2017).
  • Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning. PMLR, 4116–4126.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision. 1026–1034.
  • Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
  • Jeh and Widom (2003) Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web. 271–279.
  • Jiao et al. (2020) Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong Zhu. 2020. Sub-graph contrast for scalable self-supervised graph representation learning. In 2020 IEEE international conference on data mining (ICDM). IEEE, 222–231.
  • Jing and Tian (2020) Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 4037–4058.
  • Ketkar and Moolayil (2021) Nikhil Ketkar and Jojo Moolayil. 2021. Introduction to pytorch. In Deep learning with python. Springer, 27–91.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Kondor and Lafferty (2002) Risi Imre Kondor and John Lafferty. 2002. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th international conference on machine learning, Vol. 2002. 315–322.
  • Li et al. (2018) Yingming Li, Ming Yang, and Zhongfei Zhang. 2018. A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering 31, 10 (2018), 1863–1883.
  • MacQueen (1967) J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability. 281–297.
  • Mavromatis and Karypis (2020) Costas Mavromatis and George Karypis. 2020. Graph infoclust: Leveraging cluster-level node information for unsupervised graph representation learning. arXiv preprint arXiv:2009.06946 (2020).
  • Mo et al. (2022) Yujie Mo, Liang Peng, Jie Xu, Xiaoshuang Shi, and Xiaofeng Zhu. 2022. Simple unsupervised graph representation learning. AAAI.
  • Pan et al. (2019) Shirui Pan, Ruiqi Hu, Sai-fu Fung, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Learning graph embedding with adversarial training methods. IEEE transactions on cybernetics 50, 6 (2019), 2475–2487.
  • Peng et al. (2022) Qiyao Peng, Hongtao Liu, Yinghui Wang, Hongyan Xu, Pengfei Jiao, Minglai Shao, and Wenjun Wang. 2022. Towards a multi-view attentive matching for personalized expert finding. In Proceedings of the ACM Web Conference 2022. 2131–2140.
  • Peng et al. (2020) Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. 2020. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020. 259–270.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1150–1160.
  • Schaeffer (2007) Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review 1, 1 (2007), 27–64.
  • Tang and Liu (2011) Lei Tang and Huan Liu. 2011. Leveraging social media networks for classification. Data Mining and Knowledge Discovery 23, 3 (2011), 447–478.
  • Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In European conference on computer vision. Springer, 776–794.
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
  • Velickovic et al. (2019) Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2019. Deep Graph Infomax. ICLR (Poster) 2, 3 (2019), 4.
  • Wilder et al. (2019) Bryan Wilder, Eric Ewing, Bistra Dilkina, and Milind Tambe. 2019. End to end learning and optimization on graphs. Advances in Neural Information Processing Systems 32 (2019).
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
  • Wu et al. (2021) Lirong Wu, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. 2021. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering (2021).
  • Xu et al. (2021) Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. 2021. Self-supervised graph-level representation learning with local and global structure. In International Conference on Machine Learning. PMLR, 11548–11558.
  • Zhang and Meng (2019) Jiawei Zhang and Lin Meng. 2019. Gresnet: Graph residual network for reviving deep gnns from suspended animation. arXiv preprint arXiv:1909.05729 (2019).
  • Zhang et al. (2020) Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. 2020. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140 (2020).
  • Zhao (2021) Chen Zhao. 2021. Fairness-Aware Multi-Task and Meta Learning. Ph. D. Dissertation.
  • Zhao and Chen (2019) Chen Zhao and Feng Chen. 2019. Rank-based multi-task learning for fair regression. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 916–925.
  • Zhao et al. (2021) Chen Zhao, Feng Chen, and Bhavani Thuraisingham. 2021. Fairness-aware online meta-learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2294–2304.
  • Zhao et al. (2020) Jun Zhao, Xudong Liu, Qiben Yan, Bo Li, Minglai Shao, and Hao Peng. 2020. Multi-attributed heterogeneous graph convolutional network for bot detection. Information Sciences 537 (2020), 380–393.
  • Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131 (2020).
  • Zimmermann and Nagappan (2008) Thomas Zimmermann and Nachiappan Nagappan. 2008. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th international conference on Software engineering. 531–540.