Graph Communal Contrastive Learning
Abstract.
Graph representation learning is crucial for many real-world applications (e.g. social relation analysis). A fundamental problem for graph representation learning is how to effectively learn representations without human labeling, which is usually costly and time-consuming. Graph contrastive learning (GCL) addresses this problem by pulling the positive node pairs (or similar nodes) closer while pushing the negative node pairs (or dissimilar nodes) apart in the representation space. Despite the success of the existing GCL methods, they primarily sample node pairs based on the node-level proximity yet the community structures have rarely been taken into consideration. As a result, two nodes from the same community might be sampled as a negative pair. We argue that the community information should be considered to identify node pairs in the same communities, where the nodes insides are semantically similar. To address this issue, we propose a novel Graph Communal Contrastive Learning () framework to jointly learn the community partition and learn node representations in an end-to-end fashion. Specifically, the proposed consists of two components: a Dense Community Aggregation () algorithm for community detection and a Reweighted Self-supervised Cross-contrastive () training scheme to utilize the community information. Additionally, the real-world graphs are complex and often consist of multiple views. In this paper, we demonstrate that the proposed can also be naturally adapted to multiplex graphs. Finally, we comprehensively evaluate the proposed on a variety of real-world graphs. The experimental results show that the outperforms the state-of-the-art methods.
1. Introduction
Graph representation learning is crucial in many real-world applications, including predicting friendship in social relationships (Zhu et al., 2017), recommending products to potential users (Li and Zhang, 2020) and detecting social opinions (Wang et al., 2017; Wang et al., 2021a). However, traditional graph representation learning methods (Hu et al., 2019; Kipf and Welling, 2016a; Veličković et al., 2018a) demand labeled graph data of high quality, while such data is too expensive to obtain and sometimes even unavailable due to privacy and fairness concerns (Kang and Tong, 2021; Kang et al., 2020). To address this problem, recent Graph Contrastive Learning (GCL) (Li et al., 2019; Kipf and Welling, 2016a; Hamilton et al., 2017; You et al., 2020; Zhu et al., 2021; Qiu et al., 2020; Hassani and Khasahmadi, 2020) proposes to train graph encoders by distinguishing positive and negative node pairs without using external labels. Among them, node-level GCL methods regard all node pairs as negative samples, which pulls closer positive node pairs (representations of the same node in two views) and pushes away negative node pairs (representations of different nodes) (Wu et al., 2021). The variants of GCL (Sun et al., 2020; Hassani and Khasahmadi, 2020; Zhang et al., 2020b) employ mutual information maximization to contrast node pairs and have achieved state-of-the-art performance on many downstream tasks, such as node classification (Bhagat et al., 2011), link prediction (Liben-Nowell and Kleinberg, 2007) and node clustering (Schaeffer, 2007).
Despite the success of these methods, most of the existing GCL methods primarily focus on the node-level proximity but fail to explore the inherent community structures in the graphs when sampling node pairs. For example, MVGRL (Hassani and Khasahmadi, 2020), GCA (Zhu et al., 2021) and GraphCL (You et al., 2020) regard the same nodes in different graph views as positive pairs; (Hassani and Khasahmadi, 2020) regards the neighbors as positive node pairs; DGI (Veličković et al., 2018b) and HDI (Jing et al., 2021a) produce negative pairs by randomly shuffling the node attributes; (Zhang et al., 2020a; Qiu et al., 2020; Hafidi et al., 2020) regard negative samples as different sub-graphs. However, the structural information such as the community structures in the graphs has not been fully explored yet. The community structures can be found in many real graphs (Xu et al., 2018). For example, in social graphs, people are grouped by their interests (Scott, 1988). The group of people with the same interest tend to be densely connected by their interactions, while the people with different interests are loosely connected. Therefore, the people in the same interest community are graphically similar and treating them as negative pairs will introduce graphic errors to the node representations. To address this problem, we firstly propose a novel dense community aggregation () algorithm to learn community partition based on structural information in the graphs. Next, we introduce a novel reweighted self-supervised contrastive () training scheme to pull the nodes within the same community closer to each other in the representation space.
Additionally, in the real graphs, nodes can be connected by multiple relations (Park et al., 2020; Jing et al., 2021d, b; Yan et al., 2021), and each relation can be treated as a graph view (known as multiplex graphs (De Domenico et al., 2013)). One way for graph representation learning methods to be adapted to the multiplex graphs is to learn the representations of each graph view separately and then combining them with a fusion model (Jing et al., 2021a). However, this approach ignores the graphic dependence between different graph views. For example, IMDB (Wang et al., 2019) can be modeled as a multiplex movie graph connected by both co-actor and co-director relations. These two kinds of connections are not independent since personal relations between actors and directors would affect their participation in movies. To address this problem, we apply contrastive training that is consistent with the situation of the single-relation graph, applying pair-wise contrastive training on each pair of graph views and using a post-ensemble fashion to combine model outputs of all views in the down-stream tasks.
The main contributions of this paper are summarized as follows:
-
•
We propose a novel Dense Community Aggregation () algorithm to detect structurally related communities by end-to-end training.
-
•
We propose a Reweighted Self-supervised Cross-contrastive () training scheme, which employs community information to enhance the performance in down-stream tasks.
-
•
We adapt our model to multiplex graphs by pair-wise contrastive training, considering the dependence between different graph views.
-
•
We evaluate our model on various real-world datasets with multiple down-stream tasks and metrics. Our model achieves the state-of-the-art performance.
The rest of the paper is organized as: preliminaries (Sec. 2), methodology (Sec. 3), experiments (Sec. 4), related works (Sec. 5) and conclusion (Sec. 6).

2. Preliminaries
In this section, we introduce the similarity measurement used in this paper (Sec. 2.1), the basic concepts of contrastive learning (Sec. 2.2) and community detection (Sec. 2.3). Then we give a brief definition of attributed multiplex graphs in Sec. 2.4. The notations used in this paper are summarized in Table 1.
Symbol | Description |
---|---|
attributed graph | |
attributed multiplex graph | |
edge set | |
augmented edge set | |
attribute matrix | |
augmented attribute matrix | |
attribute matrix of view | |
adjacency matrix | |
extended adjacency matrix | |
community | |
community centroid matrix | |
community assignment matrix | |
community density matrix | |
edge count function | |
edge density function | |
entropy | |
mutual information | |
exponent cosine similarity | |
Gaussian RBF similarity | |
GCN encoder with parameters | |
linear classifier with weights | |
random augmentation | |
intra-community density score, scalar | |
inter-community density score, scalar |
2.1. Similarity Measurement
2.2. Contrastive Learning
Contrastive learning aims at distinguishing similar (positive) and dissimilar (negative) pairs of data points, encouraging the agreement of the similar pairs and the disagreement of the dissimilar ones (Chuang et al., 2020). The general process dating back to (Becker and Hinton, 1992) can be described as: For a given data point , let and be differently augmented data points derived from . and are the representations by passing and through a shared encoder. The mutual information (Cover and Thomas, 1991) (or its variants such as (Oord et al., 2018)) between the two representations is maximized. The motivation of contrastive learning is to learn representations that are invariant to perturbation introduced by the augmentation schemes (Xiao et al., 2020), and thus are robust to inherent noise in the dataset.
A widely used objective for constrastive learning is InfoNCE (Oord et al., 2018), which applies instance-wise contrastive objective by sampling the same instances in different views as positive pairs and all other pairs between different views as negative pairs. The limitation of InfoNCE is that it uniformly samples negative pairs, ignoring the semantic similarities between some instances.
2.3. Community Detection
Communities in graphs are groups of nodes that are more strongly connected among themselves than others. Community structures widely exist in many real-world graphs, such as sociology, biology and transportation systems (Newman, 2018). Detecting communities in a graph has become a primary approach to understand how structure relates to high-order information inherent to the graph (Huang et al., 2021).
2.3.1. Formulation
For a graph , with its node set and edge set , we define one of its communities as a sub-graph: , where and .
2.3.2. Modularity
A widely used evaluation for community partitions is the modularity (Newman and Girvan, 2004), defined as:
(3) |
where is the adjacency matrix, is the degree of the -th node, and is if node and are in the same community, otherwise is . The modularity measures the impact of each edge to the local edge density ( represents the expected local edge density), and is easily disturbed by the changing of edges.
2.3.3. Other notations
We define the edge count function over the adjacency matrix:
(4) |
where is the adjacency matrix for community . The edge density function compares the real edge count to the maximal possible number of edges in the given community :
(5) |
2.4. Attributed Multiplex Graph
Multiplex graphs are also known as multi-dimensional graphs (Ma et al., 2018) or multi-view graphs (Shi et al., 2018; Fu et al., 2020; Jing et al., 2021c), and are comprised of multiple single-view graphs with shared nodes and attributes but different graph structures (usually with different types of link) (Chang et al., 2015). Formally, an attributed multiplex graph is , where and each is an attributed graph. If the number of views , is equivalent to the attributed graph . We show an example of attributed multiplex graph in Fig. 2.

3. Methodology
In this section, as important components of the proposed model, we first introduce Dense Community Detection () algorithm in Sec. 3.1, and then in Sec. 3.2, we propose the Reweighted Self-supervised Cross-contrastive () training scheme. Further discussion on adaption to multiplex graphs is presented in Sec. 3.3.
3.1. Dense Community Aggregation
Node-level GCL methods are vulnerable to the problem caused by pairing structurally close nodes as negative samples. For this issue, we propose a novel Dense Community Aggregation () for detecting structurally related communities of a single graph view. Our method is inspired by the modularity (Newman and Girvan, 2004) in graphs, which measures the local edge density in communities. However, modularity is easily perturbed by variations of edges (Liu et al., 2020), which limits its robustness in detecting communities (with edges dynamically changing in each epochs). Therefore, our goal is to enhance the robustness of modularity, and to further extend the modularity by maximizing edge density in each community, while minimizing the edge density between different communities. The is carried out by end-to-end training and is illustrated in Fig. 3.

We learn the community partition by training a randomly initialized centroid matrix in an end-to-end fashion, where each represents the center of the -th community.
First, we assign each node in the graph softly to community centroids with probabilities (computed by similarity function Eq. 1 or Eq. 2). Specifically, we define a community assignment matrix , where each is a normalized similarity vector measuring the distance between the -th node and all community centroids. Formally, is computed by
(6) |
where is the similarity function defined in Sec. 2.1, is the graph encoder with parameters , and normalizes the probability of each community by dividing it by the sum of all probabilities and holds for each .
Second, we employ two objectives for training community partition (see Table 1 and Sec. 2.3.3 for symbol definitions): intra-community density
(7) |
and inter-community density
(8) |
These two objectives measure the impact of each edge on the community edge density (rather than the local edge density used in modularity (Newman and Girvan, 2004)). Specifically, in Eq. 7 and Eq. 8, the term and represent the gap between real local density () and the expected density ( for intra-community and for inter-community). The centroid matrix will be updated to reach a reasonable community partition, by minimizing the joint objective:
(9) |
where is the co-efficient. Moreover, for computational efficiency, we slightly modify the form of in our actual implementation (shown in Appendix C). Finally, we combine the objectives for two graph views, and carry out dense community aggregation simultaneously for them:
(10) |
3.2. Reweighted Self-supervised Cross-contrastive Training
We propose the Reweighted Self-supervised Cross-contrastive () training scheme in this section. We firstly apply graph augmentations to generate two graph views and secondly apply node contrast and community contrast simultaneously to consider both node-level and community-level information. We introduce node-community pairs as additional negative samples to address the problem of pairing nodes in the same communities as negative samples.
3.2.1. Graph augmentation
We define the graph augmentation as randomly masking attributes and edges with a specific probability (Zhu et al., 2021). First, we define the attribute masks as random noise vector , where each dimension of it is independently drawn from a Bernoulli distribution: . The augmented attribute matrix is computed by
(11) |
where is the Hadamard product (Horn, 1990) and is the original attribute matrix. Second, we generate the augmented edge set by randomly dropping edges from the original edge set (Zhu et al., 2020) with probability
(12) |
We apply the above graph augmentations (denoted as respectively for 2 independent augmentations) to generate two graph views: and for . Then, we obtain their representations by a shared GCN encoder: and .
3.2.2. Node contrast
After generating two graph views, we employ node contrast and community contrast simultaneously to learn the node representations. We introduce a contrastive loss based on the InfoNCE (Oord et al., 2018) to contrast node-node pairs:
(13) |
We apply the node contrast symmetrically for the two graph views:
(14) |
which distinguishes negative pairs in the two views, and enforces maximizing the consistency between positive pairs (Lin et al., 2021).
3.2.3. Community contrast
The community contrast is based on the result of in Sec. 3.1. First, we obtain the community centers by training the randomly initialized community centroids matrix with Eq. 10. Second, we adopt a re-weighted cross-contrastive objective to contrast node representations of one view with the community centroids of the other view (a cross-contrastive fashion). We are inspired by the cross-prediction scheme in (Caron et al., 2020) and introduce the cross-contrast into InfoNCE objective (Oord et al., 2018) to maximize the community consistency between two views, enforcing that the node representations in one community are pulled close to those of the counterpart community in the other view. Formally, the community contrast is computed by
(15) | ||||
where the -th node is assigned to the -th community. Here, is the RBF weight function (slightly different from Gaussian RBF similarity in Eq. 2), which gives more attention to the more similar community pairs. In this objective, the similarity of node representations within the same communities are maximized since they are contrasted positively to the same centroids, while in different communities, node representations are separated by negative contrast. Similarly, we compute the contrastive objective symmetrically for the two generated graph views:
(16) |
where and are the node representation matrices of view 1 and view 2 respectively.
3.2.4. Joint objective
We propose the -decay coefficient to combine , and into a joint objective:
(17) |
where the coefficient would decay smoothly with the proceeding of training ( refers to the epoch). We observe that the community partition would be stabilized within a few hundreds of epochs by training (see Sec. 4.3), while the training the model usually takes thousands of epochs. To this end, we apply the -decay to focus mainly on training community partition at first, and gradually divert the focus to learning node representations.
In summary, the training process is shown in Algorithm 1.
3.3. Adaptation to Multiplex graphs
The proposed framework can be naturally adapted to multiplex graphs with a few modifications on the training and inference process. We apply contrastive training between different graph views, which consider the dependence between them.
3.3.1. Training
In multiplex graphs, we no longer need to generate graph views through graph augmentation, since the different views in a multiplex graph are naturally multi-viewed data. We propose to detect communities () and learn node representations () on each pair of views. The modified training process is shown in Algorithm 2.
3.3.2. Inference
At the inference time, we propose to combine classification results of each view by classifier fusion (a post-ensemble fashion): Given the results of independent classifiers, we label the final prediction according to the confidence of each classifier (i.e. the maximal value of the output softmax distribution (Hendrycks and Gimpel, 2016)). We choose the result with the strongest confidence as the final prediction.
4. Experiments
In this section, we evaluate the proposed model with respect to diverse down-stream tasks on multiple real graphs. We introduce the specifications of experiments in Sec. 4.1, and provide detailed quantitative and visual results in Sec. 4.2 and Sec. 4.3 respectively.
4.1. Experimental Setup
4.1.1. Datasets
We use 6 real graphs, including Amazon-Computers, Amazon-Photo, Coauthor-CS, WikiCS, IMDB and DBLP, to evaluate the performance on node classification and node clustering. The detailed statistics of all datasets are summarized in Table 2.
-
•
Amazon-Computers and Amazon-Photo (Shchur et al., 2018) are two graphs of co-purchase relations, where nodes are products and there exists an edge between two products when they are bought together. Each node has a sparse bag-of-words attribute vector encoding product reviews.
-
•
Coauthor-CS (Shchur et al., 2018) is an academic network based on the Microsoft Academic Graph. The nodes are authors and edges are co-author relations. Each node has a sparse bag-of-words attribute vector encoding paper keywords of the author.
- •
-
•
IMDB (Wang et al., 2019) is a movie network with two types of links: co-actor and co-director. The attribute vector of each node is a bag-of-words feature of movie plots. The nodes are labeled with Action, Comedy or Drama.
-
•
DBLP (Wang et al., 2019) is a paper graph with three types of links: co-author, co-paper and co-term. The attribute vector of each node is a bag-of-words feature vector of paper abstracts.
Dataset | Type | Edges | Nodes | Attributes | Classes |
Amazon-Computers111https://github.com/shchur/gnn-benchmark/raw/master/data/npz/ | co-purchase | 245,778 | 13,381 | 767 | 10 |
Amazon-Photo111https://github.com/shchur/gnn-benchmark/raw/master/data/npz/ | co-purchase | 119,043 | 7,487 | 745 | 8 |
Coauthor-CS111https://github.com/shchur/gnn-benchmark/raw/master/data/npz/ | co-author | 81,894 | 18,333 | 6,805 | 15 |
WikiCS222https://github.com/pmernyei/wiki-cs-dataset/raw/master/dataset | reference | 216,123 | 11,701 | 300 | 10 |
IMDB333https://www.imdb.com/ | co-actor | 66,428 | 3,550 | 2,000 | 3 |
co-director | 13,788 | ||||
DBLP444https://dblp.uni-trier.de/ | co-author | 144,738 | 7,907 | 2,000 | 4 |
co-paper | 90,145 | ||||
co-term | 57,137,515 |
4.1.2. Evaluation protocol
We follow the evaluations in (Zhu et al., 2021), where each graph encoder is trained in a self-supervised way. Then, the encoded representations are used to train a classifier for node classification, and fit a K-means model for comparing baselines on node clustering. We train each graph encoder for five runs with randomly selected data splits (for WikiCS, we use the given 20 data splits) and report the average performance with the standard deviation.
For the node classification task, we measure the performance with Micro-F1 and Macro-F1 scores. For the node clustering task, we measure the performance with Normalized Mutual Information (NMI) score: , where and refer to the predicted cluster indexes and class labels respectively, and Adjusted Rand Index (ARI): , where is the Rand Index (Rand, 1971), which measures the similarity between two cluster indexes and class labels.
4.1.3. Baselines
We compare the methods ( for exponent cosine similarity and for Gaussian RBF similarity) with two sets of baselines:
-
•
On single-view graphs, we consider baselines as 1) traditional methods including node2vec (Grover and Leskovec, 2016) and DeepWalk (Perozzi et al., 2014), 2) supervised methods including GCN (Kipf and Welling, 2016a), and 3) unsupervised methods including MVGRL (Hassani and Khasahmadi, 2020), DGI (Veličković et al., 2018b), HDI (Jing et al., 2021a), graph autoencoders (GAE and VGAE) (Kipf and Welling, 2016b) and GCA (Zhu et al., 2021).
-
•
On multiplex graphs, we consider baselines as 1) methods with single-view representations including node2vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al., 2014), GCN (Kipf and Welling, 2016a) and DGI (Veličković et al., 2018b), and 2) methods with multi-view representations including CMNA (Chu et al., 2019), MNE (Yang et al., 2019), HAN (Wang et al., 2019), DMGI (Park et al., 2020) and HDMI (Jing et al., 2021a).
Additionally, we compare different clustering baselines including K-means, Spectral biclustering (SBC) (Kluger et al., 2003) and modularity (Newman and Girvan, 2004) to show the effectiveness of our proposed ( for exponent cosine similarity and for Gaussian RBF similarity).
4.1.4. Implementation
We use a 2-layer GCN (Kipf and Welling, 2016a) as the graph encoder for each deep learning baseline. We use Adam optimizer (Kingma and Ba, 2015) to optimize each model. Before applying the objective, we pass the representations of two graph views through a shard projection module (a 2-layer MLP) with the same input, hidden-layer and output dimensions. The number of communities is set according to the number of classes in each graph. More implementation details are listed in Appendix A.
4.2. Quantitative Results
4.2.1. Overall performance
We present the quantitative results of node classification on single-view graphs (Table 3), node clustering on single-view graphs (Table 4), and node classification on multiplex graphs (Table 5). outperforms the state-of-the-art methods on multiple tasks, and even outperforms the supervised methods.
Dataset | Amazon-Computers | Amazon-Photo | Coauthor-CS | WikiCS | ||||
---|---|---|---|---|---|---|---|---|
Metric | Micro-F1 | Macro-F1 | Micro-F1 | Macro-F1 | Micro-F1 | Macro-F1 | Micro-F1 | Macro-F1 |
RawFeatures | 73.82±0.01 | 70.10±0.04 | 78.45±0.04 | 76.10±0.01 | 90.40±0.02 | 89.01±0.06 | 72.00±0.03 | 70.28±0.09 |
Node2vec | 84.38±0.08 | 82.65±0.08 | 89.72±0.08 | 87.39±0.07 | 85.11±0.06 | 82.93±0.11 | 71.84±0.09 | 70.44±0.03 |
DeepWalk | 85.63±0.09 | 84.02±0.10 | 89.36±0.10 | 86.92±0.02 | 84.71±0.23 | 82.63±0.19 | 74.25±0.06 | 72.68±0.15 |
GCN | 86.43±0.56 | 83.99±0.61 | 92.51±0.23 | 90.47±0.32 | 93.04±0.28 | 91.02±0.38 | 77.11±0.08 | 75.61±0.19 |
MVGRL | 87.42±0.07 | 85.92±0.11 | 91.74±0.09 | 89.93±0.09 | 92.11±0.10 | 90.50±0.12 | 77.50±0.08 | 75.62±0.00 |
DGI | 83.88±0.50 | 79.30±0.42 | 91.60±0.24 | 89.31±0.16 | 92.08±0.68 | 90.78±0.68 | 75.35±0.17 | 73.74±0.20 |
HDI | 85.43±0.13 | 80.74±0.25 | 90.09±0.10 | 88.70±0.16 | 89.98±0.14 | 86.73±0.17 | 75.72±0.55 | 68.05±0.80 |
GAE | 85.18±0.21 | 83.33±0.17 | 91.68±0.14 | 89.66±0.09 | 90.00±0.75 | 88.31±0.68 | 70.17±0.05 | 68.27±0.05 |
VGAE | 86.44±0.25 | 83,72±0.12 | 92.24±0.08 | 90.04±0.17 | 92.08±0.08 | 90.11±0.06 | 75.56±0.20 | 74.12±0.10 |
GCA | 87.67±0.49 | 85.88±0.30 | 92.39±0.21 | 91.05±0.13 | 92.87±0.03 | 90.76±0.01 | 78.24±0.01 | 74.47±0.02 |
88.85±0.14 | 87.42±0.28 | 93.18±0.12 | 92.05±0.17 | 93.32±0.02 | 91.65±0.03 | 78.74±0.04 | 75.92±0.06 | |
88.74±0.09 | 87.53±0.26 | 92.79±0.17 | 91.57±0.29 | 93.31±0.01 | 91.63±0.03 | 78.74±0.03 | 75.88±0.02 |
Dataset | Amazon-Computers | Amazon-Photo | Coauthor-CS | WikiCS | ||||
---|---|---|---|---|---|---|---|---|
Metric | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI |
MVGRL | 0.244±0.000 | 0.141±0.001 | 0.344±0.040 | 0.239±0.039 | 0.740±0.010 | 0.627±0.009 | 0.263±0.010 | 0.102±0.011 |
DGI | 0.318±0.020 | 0.165±0.020 | 0.376±0.030 | 0.264±0.030 | 0.747±0.010 | 0.629±0.011 | 0.310±0.020 | 0.131±0.018 |
HDI | 0.347±0.011 | 0.216±0.006 | 0.429±0.014 | 0.307±0.011 | 0.726±0.008 | 0.607±0.016 | 0.238±0.002 | 0.105±0.000 |
GAE | 0.441±0.000 | 0.258±0.000 | 0.616±0.010 | 0.494±0.008 | 0.731±0.010 | 0.614±0.010 | 0.243±0.020 | 0.095±0.018 |
VGAE | 0.423±0.000 | 0.238±0.001 | 0.530±0.040 | 0.373±0.041 | 0.733±0.000 | 0.618±0.001 | 0.261±0.010 | 0.082±0.008 |
GCA | 0.426±0.001 | 0.246±0.001 | 0.614±0.000 | 0.494±0.000 | 0.735±0.008 | 0.618±0.010 | 0.299±0.002 | 0121±0.003 |
0.452±0.000 | 0.284±0.000 | 0.619±0.000 | 0.508±0.000 | 0.750±0.021 | 0.624±0.083 | 0.318±0.011 | 0.145±0.024 | |
0.474±0.018 | 0.277±0.024 | 0.632±0.000 | 0.524±0.001 | 0.753±0.005 | 0.621±0.007 | 0.322±0.009 | 0.140±0.010 |
Dataset | IMDB | DBLP | ||
---|---|---|---|---|
Metric | Micro-F1 | Macro-F1 | Micro-F1 | Macro-F1 |
node2vec | 0.550 | 0.533 | 0.547 | 0.543 |
DeepWalk | 0.550 | 0.532 | 0.537 | 0.533 |
GCN | 0.611 | 0.603 | 0.717 | 0.734 |
DGI | 0.606 | 0.598 | 0.720 | 0.723 |
CMNA | 0.566 | 0.549 | 0.561 | 0.566 |
MNE | 0.574 | 0.552 | 0.562 | 0.566 |
HAN | 0.607 | 0.599 | 0.708 | 0.716 |
DMGI | 0.648 | 0.648 | 0.766 | 0.771 |
HDMI | 0.658 | 0.650 | 0.811 | 0.820 |
0.672 | 0.670 | 0.832 | 0.840 | |
0.671 | 0.668 | 0.832 | 0.839 |
Moreover, the proposed (used to cluster nodes based on its community partition) outperforms traditional clustering methods and the former modularity (Newman and Girvan, 2004) on encoded representations (shown in Table 6), which illustrates that is more suitable for deep-learning-based clustering and addresses the instability of modularity.
Dataset | Amazon-Computers | Amazon-Photo | Coauthor-CS | WikiCS | ||||
---|---|---|---|---|---|---|---|---|
Metric | MNI | ARI | MNI | ARI | MNI | ARI | MNI | ARI |
K-means | 0.192±0.019 | 0.086±0.013 | 0.225±0.012 | 0.121±0.006 | 0.497±0.004 | 0.315±0.008 | 0.047±0.010 | 0.022±0.002 |
SBC | 0.160±0.000 | 0.088±0.000 | 0.100±0.000 | 0.020±0.000 | 0.415±0.000 | 0.261±0.000 | 0.039±0.000 | 0.024±0.000 |
modularity | 0.204±0.036 | 0.118±0.038 | 0.233±0.027 | 0.124±0.032 | 0.447±0.017 | 0.280±0.022 | 0.139±0.063 | 0.083±0.052 |
0.213±0.009 | 0.114±0.018 | 0.237±0.024 | 0.126±0.032 | 0.476±0.015 | 0.305±0.020 | 0.140±0.060 | 0.080±0.055 | |
0.363±0.009 | 0.262±0.010 | 0.384±0.022 | 0.273±0.025 | 0.523±0.016 | 0.358±0.018 | 0.210±0.041 | 0.136±0.060 |
4.2.2. Ablation study
We compare the different combinations of (Eq. 14), (Eq. 10) and (Eq. 16) on WikiCS in terms of node classification and node clustering. The results are listed in Table 7. We find that, first, applying node contrast is crucial to the performance (given that the combinations with outperform others) since it makes the representations more distinguishable, second, community contrast is beneficial to the overall performance (given that outperforms ) since it considers community information in contrastive training, and third, the performance of community contrast is boosted significantly when it is used with (comparing with ). Moreover, we also show the visual ablation study in Table 8.
Micro-F1(%) | Macro-F1(%) | NMI | ARI | |||||
75.09±0.29 | 69.83±0.24 | 0.254±0.005 | 0.120±0.007 | |||||
75.71±0.58 | 68.52±0.55 | 0.228±0.015 | 0.127±0.030 | |||||
65.78±0.29 | 57.25±0.36 | 0.191±0.001 | 0.102±0.001 | |||||
63.48±0.67 | 53.78±0.90 | 0.188±0.007 | 0.085±0.005 | |||||
77.29±0.20 | 74.19±0.23 | 0.273±0.012 | 0.125±0.020 | |||||
77.47±0.28 | 74.43±0.30 | 0.274±0.003 | 0.139±0.001 | |||||
72.88±0.21 | 69.97±0.24 | 0.222±0.020 | 0.120±0.030 | |||||
68.18±0.14 | 64.88±0.16 | 0.227±0.012 | 0.123±0.034 | |||||
78.74±0.04 | 75.92±0.06 | 0.318±0.011 | 0.145±0.024 | |||||
78.74±0.03 | 75.88±0.02 | 0.322±0.009 | 0.140±0.010 |
4.3. Visual Evaluations
We illustrate the significance of by visualizing the edge density and class entropy of the assigned communities. We evaluate each checkpoint for five times and show the means and deviations. We compare the results with traditional clustering methods (K-means and Spectral biclustering (Kluger et al., 2003)) and the former modularity (Newman and Girvan, 2004). We also visualize the node representations for ablation study.
4.3.1. Edge density
The edge density is based on Eq. 5 and computed by the average density of all communities:
(18) |
It is used to measure how learns the community partition that maximizes the intra-community density (see Sec. 3.1). Fig. 4 shows that, after a few hundreds of epochs, stably outperforms other clustering methods.

4.3.2. Class entropy
The class entropy is a measure of the homogeneity of class labels in a community (the extent to which a community contains one major class, or has low entropy). We argue that a good community partition should distinguish structurally separated nodes, which is, in other words, distinguishing nodes of different classes. The class entropy is computed as the average entropy of class labels in all communities:
(19) |
where is the occurrence frequency of class in the -th community. Fig. 5 shows that, after a few hundreds of epochs, stably outperforms other clustering methods.

4.3.3. Visualization of node representations.
To see how the node representations are distributed, we use t-SNE (Van der Maaten and Hinton, 2008) to reduce the dimension of node representations for visualization. The node representations of each class are distributed more separately when applying as well as , which illustrates the effectiveness of our proposed method. The results are shown in Table 8.
Full | |||||
---|---|---|---|---|---|
exponent cosine |
![]() |
![]() |
![]() |
![]() |
![]() |
Gaussian RBF |
![]() |
![]() |
![]() |
![]() |
![]() |
5. Related Works
In this section, we briefly review the related works, including graph contrastive learning (Sec. 5.1) and community detection (Sec. 5.2).
5.1. Graph Contrastive Learning
Graph contrastive learning (Li et al., 2019) proposed to pull similar nodes into close positions in the representation space while pushing dissimilar nodes apart. (Liu et al., 2021; Wu et al., 2021) comprehensively review the development of GCL. GRACE (Zhu et al., 2020) proposed a simplified framework to maximize the agreement between two graph views. GCA (Zhu et al., 2021) proposed an adaptive augmentation algorithm to generate graph views. GCC (Qiu et al., 2020) designed a pre-training task to learn structural graph representations. HeCo (Wang et al., 2021b) employed network schema and meta-path views to employ contrastive training. DGI (Veličković et al., 2018b), GMI (Peng et al., 2020) and HDI (Jing et al., 2021a) randomly shuffle node attributes to generate negative samples. MVGRL (Hassani and Khasahmadi, 2020) employed graph diffusion to transform graphs.
5.2. Community Detection
Community detection is a primary approach to understand how structure relates to high-order information inherent to the graph. (Huang et al., 2021) reviewed the recent progress in community detection. CommDGI (Zhang et al., 2020c) employed modularity for community detection. (Li et al., 2020) introduced the identifiability of communities. RWM (Luo et al., 2020) used random walk to detect local communities. Destine (Xu et al., 2021; Xu, 2021) detects dense subgraphs on multi-layered networks.
6. Conclusion
In this paper, we propose a novel Graph Communal Contrastive Learning () framework to learn the community partition of structurally related communities by a Dense Community Aggregation () algorithm, and to learn the graph representation with a reweighted self-supervised cross-contrastive () training scheme considering the community structures. The proposed consistently achieves the state-of-the-art performance on multiple tasks and can be naturally adapted to multiplex graphs. We show that community information is beneficial to the overall performance in graph representation learning.
Acknowledgements.
B. Jing and H. Tong are partially supported by National Science Foundation under grant No. 1947135, the NSF Program on Fairness in AI in collaboration with Amazon under award No. 1939725, NIFA 2020-67021-32799, and Army Research Office (W911NF2110088). The content of the information in this document does not necessarily reflect the position or the policy of the Government or Amazon, and no official endorsement should be inferred. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.References
- (1)
- Becker and Hinton (1992) Suzanna Becker and Geoffrey E Hinton. 1992. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 355, 6356 (1992), 161–163.
- Bhagat et al. (2011) Smriti Bhagat, Graham Cormode, and S Muthukrishnan. 2011. Node classification in social networks. In Social network data analytics. Springer, 115–148.
- Bugmann (1998) Guido Bugmann. 1998. Normalized Gaussian radial basis function networks. Neurocomputing 20, 1-3 (1998), 97–110.
- Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS).
- Cen et al. (2019) Yukuo Cen, Xu Zou, Jianwei Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Representation learning for attributed multiplex heterogeneous network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1358–1368.
- Chang et al. (2015) Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. 2015. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 119–128.
- Chu et al. (2019) Xiaokai Chu, Xinxin Fan, Di Yao, Zhihua Zhu, Jianhui Huang, and Jingping Bi. 2019. Cross-network embedding for multi-network alignment. In The world wide web conference. 273–284.
- Chuang et al. (2020) Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Stefanie Jegelka. 2020. Debiased Contrastive Learning. In NeurIPS.
- Cover and Thomas (1991) Thomas M Cover and Joy A Thomas. 1991. Information theory and statistics. Elements of Information Theory 1, 1 (1991), 279–335.
- De Domenico et al. (2013) Manlio De Domenico, Albert Solé-Ribalta, Emanuele Cozzo, Mikko Kivelä, Yamir Moreno, Mason A Porter, Sergio Gómez, and Alex Arenas. 2013. Mathematical formulation of multilayer networks. Physical Review X 3, 4 (2013), 041022.
- Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).
- Fu et al. (2020) Dongqi Fu, Zhe Xu, Bo Li, Hanghang Tong, and Jingrui He. 2020. A View-Adversarial Framework for Multi-View Network Embedding. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2025–2028.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
- Hafidi et al. (2020) Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, and Ananthram Swami. 2020. Graphcl: Contrastive self-supervised learning of graph representations. arXiv preprint arXiv:2007.08025 (2020).
- Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.
- Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning. PMLR, 4116–4126.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).
- Horn (1990) Roger A Horn. 1990. The hadamard product. In Proc. Symp. Appl. Math, Vol. 40. 87–169.
- Hu et al. (2019) Fenyu Hu, Yanqiao Zhu, Shu Wu, Liang Wang, and Tieniu Tan. 2019. Hierarchical Graph Convolutional Networks for Semi-supervised Node Classification. In IJCAI.
- Huang et al. (2021) Xinyu Huang, Dongming Chen, Tao Ren, and Dongqi Wang. 2021. A survey of community detection methods in multilayer networks. Data Mining and Knowledge Discovery 35, 1 (2021), 1–45.
- Jing et al. (2021a) Baoyu Jing, Chanyoung Park, and Hanghang Tong. 2021a. Hdmi: High-order deep multiplex infomax. In Proceedings of the Web Conference 2021. 2414–2424.
- Jing et al. (2021b) Baoyu Jing, Hanghang Tong, and Yada Zhu. 2021b. Network of Tensor Time Series. In Proceedings of the Web Conference 2021. 2425–2437.
- Jing et al. (2021c) Baoyu Jing, Yuejia Xiang, Xi Chen, Yu Chen, and Hanghang Tong. 2021c. Graph-MVP: Multi-View Prototypical Contrastive Learning for Multiplex Graphs. arXiv preprint arXiv:2109.03560 (2021).
- Jing et al. (2021d) Baoyu Jing, Zeyu You, Tao Yang, Wei Fan, and Hanghang Tong. 2021d. Multiplex Graph Neural Network for Extractive Text Summarization. arXiv preprint arXiv:2108.12870 (2021).
- Kang et al. (2020) Jian Kang, Jingrui He, Ross Maciejewski, and Hanghang Tong. 2020. Inform: Individual fairness on graph mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 379–389.
- Kang and Tong (2021) Jian Kang and Hanghang Tong. 2021. Fair Graph Mining. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4849–4852.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).
- Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
- Kluger et al. (2003) Yuval Kluger, Ronen Basri, Joseph T Chang, and Mark Gerstein. 2003. Spectral biclustering of microarray data: coclustering genes and conditions. Genome research 13, 4 (2003), 703–716.
- Li and Zhang (2020) Chaoyi Li and Yangsen Zhang. 2020. A personalized recommendation algorithm based on large-scale real micro-blog data. Neural Computing and Applications 32, 15 (2020), 11245–11252.
- Li et al. (2020) Hui-Jia Li, Lin Wang, Yan Zhang, and Matjaž Perc. 2020. Optimization of identifiability for efficient community detection. New Journal of Physics 22, 6 (2020), 063035.
- Li et al. (2019) Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835–3845.
- Liben-Nowell and Kleinberg (2007) David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. Journal of the American society for information science and technology 58, 7 (2007), 1019–1031.
- Lin et al. (2021) Shuai Lin, Pan Zhou, Zi-Yuan Hu, Shuojia Wang, Ruihui Zhao, Yefeng Zheng, Liang Lin, Eric Xing, and Xiaodan Liang. 2021. Prototypical Graph Contrastive Learning. arXiv preprint arXiv:2106.09645 (2021).
- Liu et al. (2020) Fanzhen Liu, Shan Xue, Jia Wu, Chuan Zhou, Wenbin Hu, Cecile Paris, Surya Nepal, Jian Yang, and S Yu Philip. 2020. Deep learning for community detection: progress, challenges and opportunities. In 29th International Joint Conference on Artificial Intelligence, IJCAI 2020. International Joint Conferences on Artificial Intelligence, 4981–4987.
- Liu et al. (2021) Yixin Liu, Shirui Pan, Ming Jin, Chuan Zhou, Feng Xia, and Philip S Yu. 2021. Graph self-supervised learning: A survey. arXiv preprint arXiv:2103.00111 (2021).
- Luo et al. (2020) Dongsheng Luo, Yuchen Bian, Yaowei Yan, Xiao Liu, Jun Huan, and Xiang Zhang. 2020. Local community detection in multiple networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 266–274.
- Ma et al. (2018) Yao Ma, Zhaochun Ren, Ziheng Jiang, Jiliang Tang, and Dawei Yin. 2018. Multi-dimensional network embedding with hierarchical structure. In Proceedings of the eleventh ACM international conference on web search and data mining. 387–395.
- Mernyei and Cangea (2020) Péter Mernyei and Cătălina Cangea. 2020. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv preprint arXiv:2007.02901 (2020).
- Newman (2018) Mark Newman. 2018. Networks. Oxford university press.
- Newman and Girvan (2004) Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Park et al. (2020) Chanyoung Park, Donghyun Kim, Jiawei Han, and Hwanjo Yu. 2020. Unsupervised attributed multiplex network embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5371–5378.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
- Peng et al. (2020) Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. 2020. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020. 259–270.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
- Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1150–1160.
- Rahutomo et al. (2012) Faisal Rahutomo, Teruaki Kitasuka, and Masayoshi Aritsugi. 2012. Semantic cosine similarity. In The 7th International Student Conference on Advanced Science and Technology ICAST, Vol. 4. 1.
- Rand (1971) William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66, 336 (1971), 846–850.
- Schaeffer (2007) Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review 1, 1 (2007), 27–64.
- Scott (1988) John Scott. 1988. Social network analysis. Sociology 22, 1 (1988), 109–127.
- Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
- Shi et al. (2018) Yu Shi, Fangqiu Han, Xinwei He, Xinran He, Carl Yang, Jie Luo, and Jiawei Han. 2018. mvn2vec: Preservation and collaboration in multi-view network embedding. arXiv preprint arXiv:1801.06597 (2018).
- Sun et al. (2020) Fan-Yun Sun, Jordon Hoffman, Vikas Verma, and Jian Tang. 2020. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In International Conference on Learning Representations.
- Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
- Veličković et al. (2018a) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018a. Graph Attention Networks. In International Conference on Learning Representations.
- Veličković et al. (2018b) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018b. Deep Graph Infomax. In International Conference on Learning Representations.
- Wang et al. (2017) Dong Wang, Jiexun Li, Kaiquan Xu, and Yizhen Wu. 2017. Sentiment community detection: exploring sentiments and relationships in social networks. Electronic Commerce Research 17, 1 (2017), 103–132.
- Wang et al. (2021a) Ruijie Wang, Zijie Huang, Shengzhong Liu, Huajie Shao, Dongxin Liu, Jinyang Li, Tianshi Wang, Dachun Sun, Shuochao Yao, and Tarek Abdelzaher. 2021a. DyDiff-VAE: A Dynamic Variational Framework for Information Diffusion Prediction. 163–172.
- Wang et al. (2019) Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. In The World Wide Web Conference. 2022–2032.
- Wang et al. (2021b) Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. 2021b. Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning. arXiv preprint arXiv:2105.09111 (2021).
- Wu et al. (2021) Lirong Wu, Haitao Lin, Zhangyang Gao, Cheng Tan, Stan Li, et al. 2021. Self-supervised on Graphs: Contrastive, Generative, or Predictive. arXiv preprint arXiv:2105.07342 (2021).
- Xiao et al. (2020) Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. 2020. What Should Not Be Contrastive in Contrastive Learning. In International Conference on Learning Representations.
- Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.
- Xu (2021) Zhe Xu. 2021. Dense subgraph detection on multi-layered networks. Ph.D. Dissertation.
- Xu et al. (2021) Zhe Xu, Si Zhang, Yinglong Xia, Liang Xiong, Jiejun Xu, and Hanghang Tong. 2021. DESTINE: Dense Subgraph Detection on Multi-Layered Networks. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3558–3562.
- Yan et al. (2021) Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing, and Hanghang Tong. 2021. Dynamic Knowledge Graph Alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4564–4572.
- Yang et al. (2019) Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. 2019. Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4066–4075.
- You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33 (2020), 5812–5823.
- Zhang et al. (2020b) Hanlin Zhang, Shuai Lin, Weiyang Liu, Pan Zhou, Jian Tang, Xiaodan Liang, and Eric P Xing. 2020b. Iterative graph self-distillation. arXiv preprint arXiv:2010.12609 (2020).
- Zhang et al. ([n.d.]) Hongming Zhang, Liwei Qiu, Lingling Yi, and Yangqiu Song. [n.d.]. Scalable Multiplex Network Embedding.
- Zhang et al. (2020a) Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun. 2020a. Motif-driven contrastive learning of graph representations. arXiv preprint arXiv:2012.12533 (2020).
- Zhang et al. (2020c) Tianqi Zhang, Yun Xiong, Jiawei Zhang, Yao Zhang, Yizhu Jiao, and Yangyong Zhu. 2020c. CommDGI: Community Detection Oriented Deep Graph Infomax. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1843–1852.
- Zhu et al. (2017) Jiang Zhu, Bai Wang, Bin Wu, and Weiyu Zhang. 2017. Emotional community detection in social network. IEICE Transactions on Information and Systems 100, 10 (2017), 2515–2525.
- Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131 (2020).
- Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.
Appendix A Implementation Details
We conduct all of our experiments with PyTorch Geometric 1.7.2 (Fey and Lenssen, 2019) and PyTorch 1.8.1 (Paszke et al., 2019) on a NVIDIA Tesla V100 GPU (with 32 GB available video memory). All single-viewed graph datasets used in experiments are available in PyTorch Geometric libraries555https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html, and all multiplex graph datasets are available in 666https://www.dropbox.com/s/48oe7shjq0ih151/data.tar.gz?dl=0. Our hyperparameter specifications are listed in Table 9. Those hyperparameters are determined empirically by gird search based on the settings of (Zhu et al., 2021).
Hyperparameter |
|
|
|
WikiCS | IMDB | DBLP | ||||||
Training epochs | 2900 |
|
600 | 1000 | 50 | 50 | ||||||
Hidden-layer dimension | 2558 | 256 | 7105 | 1368 | 8096 | 4256 | ||||||
Representation dimension | 512 | 128 | 300 | 384 | 2048 | 128 | ||||||
Learning rate | 0.01 | 0.1 | 0.0005 | 0.01 | 0.0001 | 0.001 | ||||||
Activation function | RReLU | ReLU | RReLU | PReLU | ReLU | RReLU | ||||||
1000 | 500 | 500 | 500 | 50 | 50 | |||||||
0.2 | 0.3 | 0.4 | 0.4 | 1.0 | 1.0 | |||||||
0.9 | 0.1 | 0.9 | 1.0 | 0.5 | 0.5 | |||||||
2e-5 | 8e-5 | 8e-5 | 8e-5 | 2e-4 | 2e-4 | |||||||
0.25 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |||||||
0.45 | 0.4 | 0.2 | 0.2 | 0.0 | 0.0 |
Appendix B Number of communities
We compare varying numbers of communities in the vicinity of the numbers of classes on each single-viewed graph dataset (refer to Fig. 6). Although the performances remain stable for nearly all settings, we recommend to set the number of communities according to the actual number of classes.

Appendix C Parallelization of Dense Community Aggregation
In , The intra-community density objective is computationally costly and hard to vectorize since the expected edge density is dependent on the community choice. To address this problem, we derive a lower bound of :
(20) | ||||
where is the extended adjacency matrix, and is the lower bound. We use to replace in Eq. 9.
Next, we employ the community density matrix and to vectorize the objective and . The entry of is , which naturally accords with the form of Eq. 7 and Eq. 8. Therefore, these two objectives can be re-formalized as
(21) |
and
(22) |
Finally, the vectorized objective is
(23) | ||||
which can be directly computed by the CUDA operators (Paszke et al., 2019).