This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-supervised Learning on Graphs: Contrastive, Generative,or Predictive

Lirong Wu, Haitao Lin,  Cheng Tan,
 Zhangyang Gao, and Stan.Z.Li
Abstract

Deep learning on graphs has recently achieved remarkable success on a variety of tasks, while such success relies heavily on the massive and carefully labeled data. However, precise annotations are generally very expensive and time-consuming. To address this problem, self-supervised learning (SSL) is emerging as a new paradigm for extracting informative knowledge through well-designed pretext tasks without relying on manual labels. In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive. More importantly, unlike other surveys that only provide a high-level description of published research, we present an additional mathematical summary of existing works in a unified framework. Furthermore, to facilitate methodological development and empirical comparisons, we also summarize the commonly used datasets, evaluation metrics, downstream tasks, open-source implementations, and experimental study of various algorithms. Finally, we discuss the technical challenges and potential future directions for improving graph self-supervised learning. Latest advances in graph SSL are summarized in a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning.

Index Terms:
Deep Learning, Self-supervised Learning, Graph Neural Networks, Unsupervised Learning, Survey.

1 Introduction

In recent years, deep learning on graphs has emerged as a popular research topic, due to its ability to capture both graph structures and node/edge features. However, most of the works have focused on supervised or semi-supervised learning settings, where the model is trained by specific downstream tasks with abundant labeled data, which are often limited, expensive, and inaccessible. Due to the heavy reliance on the number and quality of labels, these methods are hardly applicable to real-world scenarios, especially those requiring expert knowledge for annotation, such as medicine, meteorology, transportation, etc. More importantly, these methods are prone to suffer from problems of over-fitting, poor generalization, and weak robustness [1].

Recent advances in SSL [2, 3, 4, 5, 6, 7] have provided novel insights into reducing the dependency on annotated labels and enable the training on massive unlabeled data. The primary goal of SSL is to learn transferable knowledge from abundant unlabeled data with well-designed pretext tasks and then generalize the learned knowledge to downstream tasks with specific supervision signals. Recently, SSL has shown its promising capability in the field of computer vision (CV) [2, 3, 4] and natural language processing (NLP) [5, 6, 7]. For example, Moco [2] and SimCLR [3] augment image data by various means and then train Convolutional Neural Networks (CNNs) to capture dependencies between different augmentations. Besides, BERT [5] pre-trains the model with Masked LM and Next Sentence Prediction as pretext tasks, achieving state-of-the-art performance on up to 11 tasks compared to those conventional methods. Inspired by the success of SSL in CV and NLP domains, it is a promising direction to apply SSL to the graph domain to fully exploit graph structure information and rich unlabeled data.

Compared with image and language sequence data, applying SSL to graph data is very important and has promising research prospects. Firstly, Firstly, along with node features, graph data also contains structure, where extensive pretext tasks can be designed to capture the intrinsic relations of nodes. Secondly, real-world graphs are usually formed by specific rules, e.g., links between atoms in molecular graphs are bounded by valence bond theory. Thus, extensive expert knowledge can be incorporated as a priori into the design of pretext tasks. Finally, graph data generally supports transductive learning [8], such as node classification, where the features of Train/Val/Test samples are all available during the training process, which makes it possible to design more feature-related pretext tasks.

Refer to caption
(a) Contrastive Method
Refer to caption
(b) Generative Method
Refer to caption
(c) Predictive Method
Figure 1: A comparison among the contrastive, generative, and predictive method. (a): the contrastive method contrasts the views generated from different augmentation 𝒯1()\mathcal{T}_{1}(\cdot) and 𝒯2()\mathcal{T}_{2}(\cdot). The information about the differences and sameness between (inter-data) data-data pairs are used as self-supervision signals. (b): the generative method focuses on the (intra-data) information embedded in the graph, generally based on pretext tasks such as reconstruction, which exploit the attributes and structures of graph data as self-supervision signals. (c): the predictive method generally self-generates labels by some simple statistical analysis or expert knowledge, and designs prediction-based pretext tasks based on self-generated labels to handle the data-label relationship.

Though some works have been proposed recently to apply SSL to graph data and have achieved remarkable success [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], it is still very challenging to deal with the inherent differences between grid-like and structured-like data. Firstly, the topology of the image is a fixed grid, and the text is a simple sequence, while graphs are not restricted to these rigid structures. Secondly, unlike the assumption of independent and identical sample distribution for image and text data, nodes in the graph are correlated with each other rather than completely independent. This requires us to design pretext tasks by considering both node attributes and graph structures. Finally, there exists a gap between self-supervised pretext tasks and downstream tasks due to the divergence of their optimization objectives. Inevitably, such divergence will significantly hurt the generalization of learned models. Therefore, it is crucial to reconsider the objectives of pretext tasks to better match that of downstream tasks and make them consistent with each other.

In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of the existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive, as shown in Fig. 1. The core contributions of this survey are as follow:

  • Present comprehensive and up-to-date reviews on existing graph SSL methods and divide them into three categories: contrastive, generative, and predictive, to improve their clarity and accessibility.

  • Summarize the core mathematical ideas of recent research in graph SSL within a unified framework.

  • Summarize the commonly used datasets, evaluation metrics, downstream tasks, open-source codes, and experimental study of surveyed methods, setting the stage for developments of future works.

  • Point out the technical limitations of current research and discuss promising directions on graph SSL.

Compared to the existing surveys on SSL [1], we purely focus on SSL for graph data and present a more mathematical review on the recent progress from the year 2019 to 2021. Though there have been two surveys on graph SSL, we argue that they are immature work with various flaws and shortcomings. For example, [20] clumsily describes each method in 1-2 sentences, lacking deep insight into the mathematical ideas and implementation details behind. Moreover, the number of reviewed methods in [21] are even fewer than half of ours, as it spends too much description on those less important contents, but ignores the core of graph SSL, i.e., the design of the pretext task. Compared with [20, 21], our advantages are as follow: (1) more mathematical details, striving to summarize each method with one equation; (2) more implementation details, including 41 datasets statistics (vs 20 datasets in [20]), evaluation metrics, and open-source code; (3) more thorough experimental study for fair comparison; (4) more fine-grained, clarified and rational taxonomy; (5) more surveyed works, 71 methods (vs 47 methods in [21] vs 18 methods in [20]); (5) more up-to-data review, with almost all surveyed works published after 2019.

2 Problem Statement

2.1 Notions and Definitions

Unless particularly specified, the notations used in this survey are illustrated in Table. I.

Definition 1 (Graph): We use g=(𝒱,)g=(\mathcal{V},\mathcal{E}) to denote a graph where 𝒱\mathcal{V} is the set of NN nodes and \mathcal{E} is the set of MM edges. Let vi𝒱v_{i}\in\mathcal{V} denote a node and ei,je_{i,j} denote an edge between node viv_{i} and vjv_{j}. The ll-hop neighborhood of a node viv_{i} is denoted as 𝒩i(l)={vj𝒱|d(vi,vj)l}\mathcal{N}_{i}^{(l)}=\{v_{j}\in\mathcal{V}|d(v_{i},v_{j})\leq l\} where d(vi,vj)d(v_{i},v_{j}) is the shortest path length between node viv_{i} and vjv_{j}. In particular, the 11-hop neighborhood of a node viv_{i} is denoted as 𝒩i=𝒩i(1)={vj𝒱|ei,j}\mathcal{N}_{i}=\mathcal{N}_{i}^{(1)}=\{v_{j}\in\mathcal{V}|e_{i,j}\in\mathcal{E}\}. The graph structure can also be represented by an adjacency matrix 𝐀[0,1]N×N\mathbf{A}\in[0,1]^{N\times N} with 𝐀i,j=1\mathbf{A}_{i,j}=1 if ei,je_{i,j}\in\mathcal{E} and 𝐀i,j=0\mathbf{A}_{i,j}=0 if ei,je_{i,j}\notin\mathcal{E}.

Definition 2 (Attribute Graph): Attributed graph, an opposite concept to the unattributed one, refers to a graph where nodes or edges are associated with their own features (a.k.a, attributes). For example, each node viv_{i} in graph gg may be associated with a feature vector 𝐱id0\mathbf{x}_{i}\in\mathbb{R}^{d_{0}}, such a graph is referred to an attributed graph g=(𝒱,,𝐗)g=(\mathcal{V},\mathcal{E},\mathbf{X}) or g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), where 𝐗=[𝐱1,𝐱2,,𝐱N]\mathbf{X}=\left[\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{N}\right] is the node feature matrix. Meanwhile, an attributed graph g=(𝒱,,𝐗e)g=(\mathcal{V},\mathcal{E},\mathbf{X}^{e}) may have edge attributes 𝐗e\mathbf{X}^{e}, where 𝐗eM×b0\mathbf{X}^{e}\in\mathbb{R}^{M\times b_{0}} is an edge feature matrix with 𝐱i,jeb0\mathbf{x}^{e}_{i,j}\in\mathbb{R}^{b_{0}} being the feature vector of edge ei,je_{i,j}.

Definition 3 (Dynamic Graph): A dynamic graph is a special attributed graph where the node set, graph structure and node attributes may change dynamically over time. The dynamic graph can be formalized as g=(𝒱(t),(t),𝐗(t))g=(\mathcal{V}^{(t)},\mathcal{E}^{(t)},\mathbf{X}^{(t)}) or g=(𝐀(t),𝐗(t))g=(\mathbf{A}^{(t)},\mathbf{X}^{(t)}), where (t)\mathcal{E}^{(t)} represents the edge set at the time step tt and 𝐀i,j(t)=1\mathbf{A}^{(t)}_{i,j}=1 denotes an interaction between node viv_{i} and vjv_{j} at the time step tt (1tT1\leq t\leq T).

Definition 4 (Heterogeneous Graph): Consider a graph g=(𝒱,)g=(\mathcal{V},\mathcal{E}) with a node type mapping function fv:𝒱𝒴vf_{v}:\mathcal{V}\rightarrow\mathcal{Y}^{v} and an edge type mapping function fe:𝒴ef_{e}:\mathcal{E}\rightarrow\mathcal{Y}^{e}, where 𝒴v\mathcal{Y}^{v} is the set of node types and 𝒴e\mathcal{Y}^{e} is the set of edge types. For a graph with more than one type of node or edge, e.g., |𝒴v|>1\left|\mathcal{Y}^{v}\right|>1 or |𝒴e|>1\left|\mathcal{Y}^{e}\right|>1, we define it as a heterogeneous graph, otherwise, it is a homogeneous graph. There are some special types of heterogeneous graphs: a bipartite graph with |𝒴v|=2\left|\mathcal{Y}^{v}\right|=2 and |𝒴e|=1\left|\mathcal{Y}^{e}\right|=1, and a multiplex graph with |𝒴v|=1\left|\mathcal{Y}^{v}\right|=1 and |𝒴e|>1\left|\mathcal{Y}^{e}\right|>1.

Definition 5 (Spatial-Temporal Graph): A spatial-temporal graph is a special dynamic graph, but noly the node attributes change over time with the node set and graph structure unchanged. The spatial-temporal graph is defined as g=(𝒱,,𝐗(t))g=(\mathcal{V},\mathcal{E},\mathbf{X}^{(t)}) or g=(𝐀,𝐗(t))g=(\mathbf{A},\mathbf{X}^{(t)}), where 𝐗(t)N×d0\mathbf{X}^{(t)}\in\mathbb{R}^{N\times d_{0}} is the node feature matrix at the time step tt (1tT1\leq t\leq T).

2.2 Downstream Tasks

The downstream tasks for graph data are divided into three categories: node-level, link-level, and graph-level tasks. A node-level graph encoder fθ()f_{\theta}(\cdot) is often used to generate node embeddings for each node, and a graph-level graph encoder fγ()f_{\gamma}(\cdot) is used to generate graph-level embeddings. Finally, the learned embeddings are fed into an optional prediction head gω()g_{\omega}(\cdot) to perform specific downstream tasks.

Node-level tasks. Node-level tasks focus on the properties of nodes, and node classification is a typical node-level task where only a subset of node 𝒱L\mathcal{V}_{L} with corresponding labels 𝒴L\mathcal{Y}_{L} are known, and we denote the labeled data as 𝒟L=(𝒱L,𝒴L)\mathcal{D}_{L}=(\mathcal{V}_{L},\mathcal{Y}_{L}) and unlabeled data as 𝒟U=(𝒱U,𝒴U)\mathcal{D}_{U}=(\mathcal{V}_{U},\mathcal{Y}_{U}). Let fθ:𝒱𝒴f_{\theta}:\mathcal{V}\rightarrow\mathcal{Y} be a graph encoder trained on labeled data 𝒟L\mathcal{D}_{L} so that it can be used to infer the labels 𝒴U\mathcal{Y}_{U} of unlabeled data. Thus, the objective function for node classification can be defined as minimizing loss node\mathcal{L}_{node}, as follows

minθ,ωnode\displaystyle\min_{\theta,\omega}\mathcal{L}_{node} (𝐀,𝐗,θ,ω)=(vi,yi)𝒟L(gω(𝐡i),yi)\displaystyle\left(\mathbf{A},\mathbf{X},\theta,\omega\right)=\sum_{(v_{i},y_{i})\in\mathcal{D}_{L}}\ell\big{(}g_{\omega}(\mathbf{h}_{i}),y_{i}\big{)} (1)

where 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}) and 𝐡i\mathbf{h}_{i} is the embedding of node viv_{i} in the embedding matrix 𝐇\mathbf{H}. (,)\ell(\cdot,\cdot) denotes the cross entropy.

Link-level tasks. Link-level tasks focus on the representation of node paris or properties of edges. Taking link prediction as an example, given two nodes, the goal is to discriminate if there is an edge between them. Thus, the objective function for link prediction can be defined as,

minθ,ωlink(𝐀,𝐗,θ,ω)=vi,vj𝒱,ij(gω(𝐡i,𝐡j),𝐀i,j)\begin{split}\min_{\theta,\omega}\mathcal{L}_{link}\left(\mathbf{A},\mathbf{X},\theta,\omega\right)=\sum_{v_{i},v_{j}\in\mathcal{V},i\neq j}\ell\big{(}g_{\omega}(\mathbf{h}_{i},\mathbf{h}_{j}),\mathbf{A}_{i,j}\big{)}\end{split} (2)

where 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}) and 𝐡i\mathbf{h}_{i} is the embedding of node viv_{i} in the embedding matrix 𝐇\mathbf{H}. gω()g_{\omega}(\cdot) linearly maps the input to a 1-dimension value, and (,)\ell(\cdot,\cdot) is the cross entropy.

TABLE I: Notations used in this paper.
Notations Descriptions
m\mathbb{R}^{m} mm-dimensional Euclidean space
a,𝐚,𝐀a,\mathbf{a},\mathbf{A} Scalar, vector, matrix
𝒢\mathcal{G} The set of graphs, 𝒢={g1,g2,,g|𝒢|}\mathcal{G}=\{g_{1},g_{2},\cdots,g_{|\mathcal{G}|}\}
gg A graph g=(𝒱,)g=(\mathcal{V},\mathcal{E})
𝒱\mathcal{V} The set of nodes in graph gg
viv_{i} A node vi𝒱v_{i}\in\mathcal{V}
\mathcal{E} The set of edges in graph gg
ei,je_{i,j} An edge ei,je_{i,j}\in\mathcal{E} between node viv_{i} and node vjv_{j}
NN Number of nodes, N=|𝒱|N=|\mathcal{V}|
MM Number of edges, M=||M=|\mathcal{E}|
𝐀N×N\mathbf{A}\in\mathbb{R}^{N\times N} A graph adjacency matrix
𝐃\mathbf{D} The degree matrix of 𝐀\mathbf{A}, 𝐃ii=j=1n𝐀ij\mathbf{D}_{ii}=\sum_{j=1}^{n}\mathbf{A}_{ij}
𝐈N\mathbf{I}_{N} Identity matrix of dimension NN
𝒩i(l)\mathcal{N}_{i}^{(l)} ll-hop Neighborhood set of node viv_{i}
𝒩i\mathcal{N}_{i} 11-hop Neighborhood set of node viv_{i}
LL The layer number
ll The layer index, 1lL1\leq l\leq L
TT The time step/iteration number
tt The time step/iteration index, 1tT1\leq t\leq T
d0d_{0} Dimension of node feature vectors
dld_{l} Dimension of node embeddings in the ll-th layer
b0b_{0} Dimension of edge feature vectors
𝐱id0\mathbf{x}_{i}\in\mathbb{R}^{d_{0}} Feature vector of node viv_{i}
𝐗N×d0\mathbf{X}\in\mathbb{R}^{N\times d_{0}} Node feature matrix, 𝐗=[𝐱1,𝐱2,,𝐱N]\mathbf{X}=\left[\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{N}\right]
𝐗(t)N×d0\mathbf{X}^{(t)}\in\mathbb{R}^{N\times d_{0}} Node feature matrix at the time step tt
𝐱i,jeb0\mathbf{x}^{e}_{i,j}\in\mathbb{R}^{b_{0}} Feature vector of edge ei,je_{i,j}
𝐗eM×b0\mathbf{X}^{e}\in\mathbb{R}^{M\times b_{0}} Edge feature matrix
𝐡i(l)dl\mathbf{h}_{i}^{(l)}\in\mathbb{R}^{d_{l}} Node embedding of node viv_{i} in the ll-th layer
𝐇(l)N×dl\mathbf{H}^{(l)}\in\mathbb{R}^{N\times d_{l}} Embedding matrix in the ll-th layer
𝐡idL\mathbf{h}_{i}\in\mathbb{R}^{d_{L}} Node embedding in the LL-th layer, 𝐡i=𝐡i(L)\mathbf{h}_{i}=\mathbf{h}_{i}^{(L)}
𝐇N×dL\mathbf{H}\in\mathbb{R}^{N\times d_{L}} Embedding matrix in the LL-th layer, 𝐇=𝐇(L)\mathbf{H}=\mathbf{H}^{(L)}
𝐡gdL\mathbf{h}_{g}\in\mathbb{R}^{d_{L}} Graph-level representation of graph gg
|||\cdot| The length of a set
\odot Element-wise multiplication operation
\| Vector concatenation
σ()\sigma(\cdot) The logistic sigmoid activation function
tanh()\tanh(\cdot) The hyperbolic tangent activation function
LeakyReLU()(\cdot) The LeakyReLU activation function
READOUT()\operatorname{READOUT}(\cdot) The readout function
fθ,fθ1,fθ2,f_{\theta},f_{\theta_{1}},f_{\theta_{2}},\cdots Node-level encoder to output 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})
fγ,fγ1,fγ2,f_{\gamma},f_{\gamma_{1}},f_{\gamma_{2}},\cdots Graph-level encoder to output 𝐡g=fγ(𝐀,𝐗)\mathbf{h}_{g}=f_{\gamma}(\mathbf{A},\mathbf{X})
gω,gω1,gω2,g_{\omega},g_{\omega_{1}},g_{\omega_{2}},\cdots The prediction head
𝒯,𝒯1,𝒯2,\mathcal{T},\mathcal{T}_{1},\mathcal{T}_{2},\cdots The data augmentation transformation
𝐖,Θ,θ,γ,ω\mathbf{W},\Theta,\theta,\gamma,\omega Learnable model parameters

Graph-level tasks. Graph-level tasks learn from multiple graphs in a dataset and predict the property of a single graph. For example, graph regression is a typical graph-level task where only a subset of graphs 𝒢L\mathcal{G}_{L} with corresponding properties 𝒫L\mathcal{P}_{L} are known, and we denote it as 𝒟L=(𝒢L,𝒫L)\mathcal{D}_{L}=(\mathcal{G}_{L},\mathcal{P}_{L}). Let fγ:𝒢𝒫f_{\gamma}:\mathcal{G}\rightarrow\mathcal{P} be a graph encoder trained on labeled data 𝒟L\mathcal{D}_{L} and then used to infer the properties 𝒫U\mathcal{P}_{U} of unlabeled graphs 𝒢U\mathcal{G}_{U}. Thus, the objective function for graph regression can be defined as minimizing loss graph\mathcal{L}_{graph},

minγ,ωgraph(𝐀i,𝐗i,γ,ω)=(gi,pi)𝒟L(gω(𝐡gi),pi)\min_{\gamma,\omega}\mathcal{L}_{graph}\left(\mathbf{A}_{i},\mathbf{X}_{i},\gamma,\omega\right)=\sum_{\left(g_{i},p_{i}\right)\in\mathcal{D}_{L}}\ell\big{(}g_{\omega}(\mathbf{h}_{g_{i}}),p_{i}\big{)} (3)

where 𝐡gi=fγ(𝐀,𝐗)\mathbf{h}_{g_{i}}=f_{\gamma}(\mathbf{A},\mathbf{X}) is the graph-level representation of graph gig_{i}. gω()g_{\omega}(\cdot) linearly maps the input to a 1-dimension value, and (,)\ell(\cdot,\cdot) is the mean absolute error.

2.3 Graph Neural Networks

Graph neural networks (GNN) [22, 23, 8] are a family of neural networks that have been widely used as the backbone encoder in most of the reviewed works. A general GNN framework involves two key computations for each node viv_{i} at every layer: (1) AGGREGATE\operatorname{AGGREGATE} operation: aggregating messages from neighborhood 𝒩i\mathcal{N}_{i}; (2) UPDATE\operatorname{UPDATE} operation: updating node representation from its representation in the previous layer and the aggregated messages. Considering a LL-layer GNN, the formulation of the ll-th layer is as follows

𝐚i(l)\displaystyle\mathbf{a}_{i}^{(l)} =AGGREGATE(l)({𝐡j(l1):vj𝒩i})\displaystyle=\operatorname{AGGREGATE}^{(l)}\left(\left\{\mathbf{h}_{j}^{(l-1)}:v_{j}\in\mathcal{N}_{i}\right\}\right) (4)
𝐡i(l)\displaystyle\mathbf{h}_{i}^{(l)} =UPDATE(l)(𝐡i(l1),𝐚i(l))\displaystyle=\operatorname{UPDATE}^{(l)}\left(\mathbf{h}_{i}^{(l-1)},\mathbf{a}_{i}^{(l)}\right)

where 1lL1\leq l\leq L and 𝐡i(l)\mathbf{h}_{i}^{(l)} is the embedding of node viv_{i} in the ll-th layer with 𝐡i(0)=𝐱i\mathbf{h}_{i}^{(0)}=\mathbf{x}_{i}. For node-level or edge-level tasks, the node representation 𝐡i(L)\mathbf{h}_{i}^{(L)} can sometimes be used for downstream tasks directly. However, for graph-level tasks, an extra READOUT\operatorname{READOUT} function is required to aggregate node features to obtain a graph-level representation 𝐡g\mathbf{h}_{g}, as follow

𝐡g=READOUT({𝐡i(L)vi𝒱})\displaystyle\mathbf{h}_{g}=\operatorname{READOUT}\left(\left\{\mathbf{h}_{i}^{(L)}\mid v_{i}\in\mathcal{V}\right\}\right) (5)

The design of these component functions is crucial, but it is beyond the scope of this paper. For a thorough review, we refer readers to the recent survey [24].

Refer to caption
(a) Pre-train&Fine-tune (P&F)
Refer to caption
(b) Joint Learning (JL)
Refer to caption
(c) Unsupervised Representation Learning (URL)
Figure 2: An overview of training strategies for graph SSL. The training strategies can be divided into three categories. (a): for the Pre-train&Fine-tune strategy, it first pre-trains the encoder fθ()f_{\theta}(\cdot) with unlabeled nodes by the self-supervised pretext tasks. The pre-trained encoder’s parameters θinit\theta_{init} are then used as the initialization of the encoder for supervised fine-tuning on downstream tasks. (b): for the Joint Learning strategy, an auxiliary pretext task is included to help learn the supervised downstream task. The encoder is trained through both the pretext task and the downstream task simultaneously. (c): for the Unsupervised Representation Learning strategy, it first pre-trains the encoder fθ()f_{\theta}(\cdot) with unlabeled nodes by the self-supervised pretext tasks. The pre-trained encoder’s parameters θinit\theta_{init} are then frozen and used in the supervised downstream task with additional labels.

2.4 Training Strategy

The training strategies can be divided into three categories: Pre-training and Fine-tuning (P&F), Joint Learning (JL), and Unsupervised Representation Learning (URL), with their detailed workflow shown in Fig. 2.

Pre-training and Fine-tuning (P&F). In this strategy, the model is trained in a two-stage paradigm [10]. The encoder fθ()f_{\theta}(\cdot) is pre-trained with the pretext tasks, then the pre-trained parameters θinit\theta_{init} are used as the initialization of the encoder fθinit()f_{\theta_{init}}(\cdot). At the fine-tuning stage, the pre-trained encoder fθinit()f_{\theta_{init}}(\cdot) is fine-tuned with a prediction head gω()g_{\omega}(\cdot) under the supervision of specific downstream tasks. The learning objective is formulated as

θ,ω=argmin(θ,ω)task(fθinit,gω)\theta^{*},\omega^{*}=\arg\min_{(\theta,\omega)}\mathcal{L}_{task}(f_{\theta_{init}},g_{\omega}) (6)

with initialization θinit=argminθssl(fθ)\theta_{init}=\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta}), where task\mathcal{L}_{task} and ssl\mathcal{L}_{ssl} is the loss function of downstream tasks and self-supervised pretext tasks, respectively.

Joint Learning. In this scheme, the encoder fθ()f_{\theta}(\cdot) is jointly trained with a prediction head gω()g_{\omega}(\cdot) under the supervision of pretext tasks and downstream tasks. The joint learning strategy can also be considered as a kind of multi-task learning or the pretext tasks are served as a regularization. The learning objective is formulated as

θ,ω=argmin(θ,ω)task(fθ,gω)+αargminθssl(fθ)\theta^{*},\omega^{*}=\arg\min_{(\theta,\omega)}\mathcal{L}_{task}(f_{\theta},g_{\omega})+\alpha\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta}) (7)

where α\alpha is a trade-off hyperparameter.

Unsupervised Representation Learning. This strategy can also be considered as a two-stage paradigm, with the first stage similar to Pre-training. However, at the second stage, the pre-trained parameters θinit\theta_{init} are frozen and the model is trained on the frozen representations with downstream tasks only. The learning objective is formulated as

ω=argminωtask(fθinit,gω)\omega^{*}=\arg\min_{\omega}\mathcal{L}_{task}(f_{\theta_{init}},g_{\omega}) (8)

with initialization θinit=argminθssl(fθ)\theta_{init}=\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta}). Compared to other schemes, unsupervised representation learning is more challenging since the learning of the encoder fθ()f_{\theta}(\cdot) depends only on the pretext task and is frozen in the second stage. In contrast, in the P&F strategy, the encoder fθ()f_{\theta}(\cdot) can be further optimized under the supervision of the downstream task during the fine-tuning stage.

3 Contrastive Learning

3.1 A Unified Perspective

Inspired by the recent advances of contrastive learning in CV and NLP domains, some works have been proposed to apply contrastive learning for graph data. However, most works simply present motivations or implementations from different perspectives, but adopt very similar (or even the same) architectures and designs in practice, which leads to the emergence of duplicative efforts and hinders the healthy development of the community. In this survey, we therefore review existing work from a unified perspective and unify them into a general framework, and present various designs for the three main modules for contrastive learning, e.g., data augmentation, pretext tasks, and contrastive objectives. In turn, the contributions of existing work can be essentially summarized as innovations in these three modules.

In practice, we usually generate multiple views for each instance through various data augmentations. Two views generated from the same instance are usually considered as a positive pair, while two views generated from different instances are considered as a negative pair. The primary goal of contrastive learning is to maximize the agreement of two jointly sampled positive pairs against the agreement of two independently sampled negative pairs. The agreement between views is usually measured through Mutual Information (MI) estimation. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), KK different transformations 𝒯1,𝒯1,,𝒯K\mathcal{T}_{1},\mathcal{T}_{1},\cdots,\mathcal{T}_{K} can be applied to obtain multiple views {(𝐀k,𝐗k)}k=1K\{(\mathbf{A}_{k},\mathbf{X}_{k})\}_{k=1}^{K}, defined as

𝐀k,𝐗k=𝒯k(𝐀,𝐗);k=1,2,,K\mathbf{A}_{k},\mathbf{X}_{k}=\mathcal{T}_{k}(\mathbf{A},\mathbf{X});k=1,2,\cdots,K (9)

Secondly, a set of graph encoders {fθk}k=1K\{f_{\theta_{k}}\}_{k=1}^{K} (may be identical or share weights) can be used to generate different representations 𝐡1,𝐡2,,𝐡K\mathbf{h}_{1},\mathbf{h}_{2},\cdots,\mathbf{h}_{K} for each view, given by

𝐡k=fθk(𝐀i,𝐗i);k=1,2,,K\mathbf{h}_{k}=f_{\theta_{k}}(\mathbf{A}_{i},\mathbf{X}_{i});k=1,2,\cdots,K (10)

The contrastive learning aims to maximize the mutual information of two views from the same instance as

maxθ1,θ2,,θKijiαi,j(𝐡i,𝐡j)\max_{\theta_{1},\theta_{2},\cdots,\theta_{K}}\sum_{i}\sum_{j\neq i}\alpha_{i,j}\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j}) (11)

where i,j{1,2,,K}i,j\in\{1,2,\cdots,K\}, {𝐡i}i=1K\{\mathbf{h}_{i}\}_{i=1}^{K} are representations generated from g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), which are taken as positive samples. (𝐡i,𝐡j)\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j}) are the mutual information between two representations 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j}. Note that depending on different pretext tasks, {𝐡k}k=1K\{\mathbf{h}_{k}\}_{k=1}^{K} may not be at the same scale, either being a node-level, subgraph-level, or graph-level representation. The negative samples to contrast with {𝐡i}i=1K\{\mathbf{h}_{i}\}_{i=1}^{K} can be taken as representations {𝐡~i}i=1K\{\widetilde{\mathbf{h}}_{i}\}_{i=1}^{K} that are generated from another graph g~=(𝐀~,𝐗~)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}). Besides, we have αi,j{0,1}\alpha_{i,j}\in\{0,1\}, and their concrete values vary in different schemes.

The design of the contrastive learning for graph data can be summarized as three main modules: (1) data augmentation strategy, (2) pretext task, and (3) contrastive objective. The design of graph encoder is not the focus of graph self-supervised learning and beyond the scope of this survey; for more details, please refer to the related survey [24].

3.2 Data Augmentation

The recent works in the CV domain show that the success of contrastive learning relies heavily on well-designed data augmentation strategies, and in particular, certain kinds of augmentations play a very important role in improving performance. However, due to the inherent non-Euclidean properties of graph data, it is difficult to directly apply data augmentations designed for images to graphs. Here, we divide the data augmentation strategy for graph data into four categories: feature-based, structure-based, sampling-based, and adaptive augmentation. An overview of four types of augmentations is presented in Fig. 3.

3.2.1 Feature-based Augmentation

Given an input graph (𝐀,𝐗)(\mathbf{A},\mathbf{X}), a feature-based augmentation only performs transformation on the node feature matrix 𝐗\mathbf{X} or edge feature matrix 𝐗e\mathbf{X}^{e}. Without loss of generality, we take 𝐗\mathbf{X} as an example, give by

𝐀~,𝐗~=𝒯(𝐀,𝐗)=𝐀,𝒯𝐗(𝐗)\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathbf{A},\mathcal{T}_{\mathbf{X}}(\mathbf{X}) (12)

Attribute Masking. The attribute masking [25, 10, 11, 26, 27] randomly masks a small portion of attributes. We specify 𝒯𝐗(𝐗)\mathcal{T}_{\mathbf{X}}(\mathbf{X}) for the attribute masking as

𝒯𝐗(𝐗)=𝐗(1𝐋)+𝐌𝐋\mathcal{T}_{\mathbf{X}}(\mathbf{X})=\mathbf{X}\odot(1-\mathbf{L})+\mathbf{M}\odot\mathbf{L} (13)

where 𝐋\mathbf{L} is a masking location matrix where 𝐋i,j=1\mathbf{L}_{i,j}=1 if the jj-th element of node viv_{i} is masked, otherwise 𝐋i,j=0\mathbf{L}_{i,j}=0. 𝐌\mathbf{M} denotes a masking value matrix. The matrix 𝐋\mathbf{L} is usually sampled by Bernoulli distribution or assigned manually. Besides, different schemes of 𝐌\mathbf{M} result in different augmentations. For example, 𝐌=𝟎\mathbf{M}=\mathbf{0} denotes a constant masking, 𝐌N(𝟎,𝚺)\mathbf{M}\sim N(\mathbf{0},\mathbf{\Sigma}) replaces the original values by Gaussian noise and 𝐌N(𝐗,𝚺)\mathbf{M}\sim N(\mathbf{X},\mathbf{\Sigma}) adds Gaussian noise to the input.

Attribute Shuffling. The attribute shuffling [9, 28, 29, 30, 31] performs the row-wise shuffling on the attribute matrix 𝐗\mathbf{X}. That is, the augmented graph consists of the same nodes as the original graph, but they are located in different places in the graph, and receive different contextual information. We specify 𝒯𝐗(𝐗)\mathcal{T}_{\mathbf{X}}(\mathbf{X}) for the attribute shuffling as

𝒯𝐗(𝐗)=𝐗[idx,:]\mathcal{T}_{\mathbf{X}}(\mathbf{X})=\mathbf{X}[idx,:] (14)

where idxidx is a list containing numbers from 1 to N(N=|𝒱|)N(N=|\mathcal{V}|), but with a random arrangement.

Refer to caption
(a) Feature-based
Refer to caption
(b) Structue-based
Refer to caption
(c) Sampling-based
Refer to caption
(d) Adaptive
Figure 3: A comparison of the feature-based, structue-based, sampling-based, and adaptive augmentation. The feature-based augmentation generally randomly (or manually) masks a small portion of node or edge attributes with constants or random values. The structue-based augmentation randomly (or manually) adds or removes small portions of edges from the graph, which includes methods like edge perturbation, node insertion, and edge diffusion. The sampling-based augmentation samples nodes and their connected edges from the graph under specific rules, which include Uniform Sampling, Ego-net Sampling, Random Walk Sampling, Importance Sampling, Knowledge Sampling, etc. The adaptive sampling adopts attention or gradient-based schemes to perform adaptive sampling based on the learned attention score or gradient magnitude. The numbers in the Fig. 3(d) are the importance scores of the nodes and edges, and we sample the most important 4 nodes and 3 edges as an example.

3.2.2 Structue-based Augmentation

Given a graph (𝐀,𝐗)(\mathbf{A},\mathbf{X}), a structue-based augmentation only performs transformation on adjacent matrix 𝐀\mathbf{A}, as follows

𝐀~,𝐗~=𝒯(𝐀,𝐗)=𝒯𝐀(𝐀),𝐗\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathcal{T}_{\mathbf{A}}(\mathbf{A}),\mathbf{X} (15)

Edge Perturbation. The edge perturbation [25, 14, 32, 33, 34] perturbs structural connectivity through randomly adding or removing a certain ratio of edges. We specify 𝒯𝐀(𝐀)\mathcal{T}_{\mathbf{A}}(\mathbf{A}) for the edge perturbation as

𝒯𝐀(𝐀)=𝐀(1𝐋)+(1𝐀)𝐋\mathcal{T}_{\mathbf{A}}(\mathbf{A})=\mathbf{A}\odot(1-\mathbf{L})+(1-\mathbf{A})\odot\mathbf{L} (16)

where 𝐋\mathbf{L} is a perturbation location matrix where 𝐋i,j=𝐋j,i=1\mathbf{L}_{i,j}=\mathbf{L}_{j,i}=1 if the edge between node viv_{i} and vjv_{j} will be perturbed, otherwise 𝐋i,j=𝐋j,i=0\mathbf{L}_{i,j}=\mathbf{L}_{j,i}=0. Different values in 𝐋\mathbf{L} result in different perturbation strategies, and more values set to 1 in 𝐋\mathbf{L}, more server the perturbation is.

Node Insertion. The node insertion [34] adds KK nodes 𝒱a={vN+k}k=1K\mathcal{V}_{a}=\{v_{N+k}\}_{k=1}^{K} to node set 𝒱\mathcal{V} and add some edges between 𝒱a\mathcal{V}_{a} and 𝒱\mathcal{V}. For a structure transformation 𝐀~=𝒯𝐀(𝐀)\widetilde{\mathbf{A}}=\mathcal{T}_{\mathbf{A}}(\mathbf{A}), we have 𝐀~:N,:N=𝐀\widetilde{\mathbf{A}}_{:N,:N}=\mathbf{A}. Given the connection ratio rr, we have

p(𝐀~i,j=𝐀~j,i=1)=r,p(𝐀~i,j=𝐀~j,i=0)=1rp(\widetilde{\mathbf{A}}_{i,j}=\widetilde{\mathbf{A}}_{j,i}=1)=r,p(\widetilde{\mathbf{A}}_{i,j}=\widetilde{\mathbf{A}}_{j,i}=0)=1-r (17)

for N+1i,jN+KN+1\leq i,j\leq N+K.

Edge Diffusion. The edge diffusion [35, 18] generates a different topological view of the original graph structure, with the general edge diffusion process defined as

𝒯𝐀(𝐀)=k=0Θk𝐒k\mathcal{T}_{\mathbf{A}}(\mathbf{A})=\sum_{k=0}^{\infty}\Theta_{k}\mathbf{S}^{k} (18)

where 𝐒N×N\mathbf{S}\in\mathbb{R}^{N\times N} is the generalized transition matrix and Θ\Theta is the weighting coefficient which satisfies k=0Θk=1,Θk[0,1]\sum_{k=0}^{\infty}\Theta_{k}=1,\Theta_{k}\in[0,1]. Two instantiations of Equ. 18 are: (1) Personalized PageRank (PPR) with 𝐒=𝐃1/2𝐀𝐃1/2\mathbf{S}=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2} and Θk=α(1α)k\Theta_{k}=\alpha(1-\alpha)^{k}, and (2) Heat Kernel (HK) with 𝐒=𝐀𝐃1\mathbf{S}=\mathbf{A}\mathbf{D}^{-1} and Θk=ettk/k!\Theta_{k}=e^{-t}t^{k}/k!, where α\alpha denotes teleport probability in a random walk and tt is the diffusion time. The closed-form solutions of PPR and HK diffusion are formulated as

𝒯𝐀PPR(𝐀)=\displaystyle\mathcal{T}_{\mathbf{A}}^{PPR}(\mathbf{A})= α(𝐈n(1α)𝐃1/2𝐀𝐃1/2)1\displaystyle\alpha\left(\mathbf{I}_{n}-(1-\alpha)\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}\right)^{-1} (19)
𝒯𝐀HK(𝐀)=\displaystyle\mathcal{T}_{\mathbf{A}}^{HK}(\mathbf{A})= exp(t𝐀𝐃1t)\displaystyle\exp\left(t\mathbf{A}\mathbf{D}^{-1}-t\right)

3.2.3 Sampling-based Augmentation

Given an input graph (𝐀,𝐗)(\mathbf{A},\mathbf{X}), a sampling-based augmentation performs transformation on both the adjacent matrix 𝐀\mathbf{A} and feature matrix 𝐗\mathbf{X}, as follows

𝐀~,𝐗~=𝒯(𝐀,𝐗)=𝐀[𝒮,𝒮],𝐗[𝒮,:]\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathbf{A}[\mathcal{S},\mathcal{S}],\mathbf{X}[\mathcal{S},:] (20)

where 𝒮𝒱\mathcal{S}\in\mathcal{V} and existing methods usually apply five sampling strategies to obtain the node subset 𝒮\mathcal{S}: uniform sampling, ego-nets sampling, random walk sampling, importance sampling, and knowledge-based sampling.

Uniform Sampling. The uniform sampling [34] (a.k.a Node Dropping) uniformly samples a given number of nodes 𝒮\mathcal{S} from 𝒱\mathcal{V} and remove the remaining nodes directly.

Ego-nets Sampling [11, 36, 37]. Given a typical graph encoder with LL layers, the computation of the node representation only depends on its LL-hop neighborhood. In particular, for each node viv_{i}, the transformation 𝒯()\mathcal{T}(\cdot) samples the LL-ego net surrounding node viv_{i}, with 𝒮\mathcal{S} defined as

𝒮={vjd(vi,vj)L}\mathcal{S}=\{v_{j}\mid d(v_{i},v_{j})\leq L\} (21)

where d(vi,vj)d(v_{i},v_{j}) is the shortest path length between node viv_{i} and vjv_{j}. The Ego-nets Sampling is essentially a special version of Breadth-First Search (BFS) sampling.

Random Walk Sampling [38, 25, 18]. It starts a random walk on graph gg from the ego node viv_{i}. The walk iteratively travels to its neighborhood with the probability proportional to the edge weight. In addition, at each step, the walk returns back to the starting node viv_{i} with a positive probability α\alpha. Finally, the visited nodes are collected into a node subset 𝒮\mathcal{S}.

Importance Sampling [19]. Given a node viv_{i}, we can sample a subgraph based on the importance of its neighboring nodes, with an importance score matrix 𝐌\mathbf{M} defined as

𝐌=α(𝐈n(1α)𝐀𝐃1)\mathbf{M}=\alpha\cdot(\mathbf{I}_{n}-(1-\alpha)\cdot\mathbf{A}\mathbf{D}^{-1}) (22)

where α[0,1]\alpha\in[0,1] is a hyperparameter. For a given node viv_{i}, the subgraph sampler chooses top-kk important neighbors anchored by viv_{i} to constitute a subgraph with the index of chosen nodes denoted as 𝒮=toprank(𝐌(i,:),k)\mathcal{S}=\operatorname{top}_{-}\operatorname{rank}(\mathbf{M}(i,:),k).

Knowledge Sampling [39]. The knowledge-based sampling incorporates domain knowledge into subgraph sampling. For example, the sampling process can be formalized as a library-based matching problem by counting the frequently occurring and bioinformatics substructures in the molecular graph and building libraries (or tables) for them.

3.2.4 Adaptive Augmentation

The adaptive augmentation usually employs attention scores or gradients to guide the selection of nodes or edges.

Attention-based. The attention-based methods typically define importance scores for nodes or edges and then augment data based on their importance. For example, GCA [40] proposes to keeps important structures and attributes unchanged, while perturbing possibly unimportant edges and features. Specifically, the probability of edge removal and feature masking should be closely related to their importance. Given a node centrality measure φc():𝒱+\varphi_{c}(\cdot):\mathcal{V}\rightarrow\mathbb{R}^{+}, it defines edge centrality as the average of two adjacent nodes’ centrality scores, i.e., si,j=logφc(vi)+φc(vj)2s_{i,j}=\log\frac{\varphi_{c}(v_{i})+\varphi_{c}(v_{j})}{2}. Then, the importance of the edge ei,je_{i,j} is defined as

pi,j=min(smaxsi,jsmaxμspe,pτ)p_{i,j}=\min\left(\frac{s_{\max}-s_{i,j}}{s_{\max}-\mu_{s}}\cdot p_{e},p_{\tau}\right) (23)

where pep_{e} is a hyperparameter that controls the overall probability of removing edges, smaxs_{\max} and μs\mu_{s} is the maximum and average of {si,j}j=1N\{s_{i,j}\}_{j=1}^{N} and pτ<1p_{\tau}<1 is a cut-off probability, used to truncate the probabilities since extremely high removal probabilities will overly corrupt graph structures. The node centrality can be defined as degree centrality, Eigenvector centrality [41], or PageRank centrality [42], which results in three variants. The attribute masking based on node importance is the same as above and will not be repeated.

Gradient-based. Unlike the simple uniform edge removal and insertion as in GRACE [26], GROC [43] adaptively performs gradient-based augmentation guided by edge gradient information. Specifically, it first applies two stochastic transformations 𝒯1()\mathcal{T}_{1}(\cdot) and 𝒯2()\mathcal{T}_{2}(\cdot) to graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}) to obtain two views, masking node attributes independently with probability r1r_{1} and r2r_{2} and then computing the contrastive loss ssl\mathcal{L}_{ssl} between these two views. For a given node viv_{i}, an edge removal candidate set is defined as

𝒮={(vi,vk)|vk𝒩i(l)}\displaystyle\mathcal{S}^{-}=\left\{(v_{i},v_{k})\Big{|}v_{k}\in\mathcal{N}_{i}^{(l)}\right\} (24)

, and an edge insertation candidate set is defined as

𝒮+={(vi,vk)|vk(vm𝒩m(l)\𝒩i(l))}\displaystyle\mathcal{S}^{+}=\left\{(v_{i},v_{k})\Big{|}v_{k}\in\left(\cup_{v_{m}\in\mathcal{B}}\mathcal{N}_{m}^{(l)}\backslash\mathcal{N}_{i}^{(l)}\right)\right\} (25)

where 𝒱\mathcal{B}\subset\mathcal{V} is a node batch. 𝒮+\mathcal{S}^{+} is restricted to the set of edges (vi,vk)(v_{i},v_{k}) where viv_{i} is an anchor node, and vkv_{k} is within the ll-hop neighborhood of some other anchors vmviv_{m}\neq v_{i} but not within the ll-hop neighborhood of node viv_{i}. Finally, we backpropagate the loss ssl\mathcal{L}_{ssl} to obtain gradient intensity values for each edge in 𝒮\mathcal{S}^{-} and 𝒮+\mathcal{S}^{+}. A further gradient-based adaptive augmentations are applied on the views by removing a subset of edges with minimal edge gradient magnitude values in 𝒮\mathcal{S}^{-} and inserting a subset of edges with the maximal edge gradient magnitude values in 𝒮+\mathcal{S}^{+}.

Refer to caption
Figure 4: A general framework for contrastive learning methods with three main modules: data augmentation strategies, pretext tasks, and contrastive objectives. Different views can be generated by a single or a combination of augmentations 𝒯1()\mathcal{T}_{1}(\cdot) and 𝒯2()\mathcal{T}_{2}(\cdot). For graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot), the commonly used graph neural networks include GAE [44], VGAE [44], etc. However, the design of graph encoder is not the focus of graph SSL and thus beyond the scope of this survey. The two contrasting views may be local, contextual, or global, corresponding to node-level (marked in red), subgraph-level (marked in green), or graph-level (marked in yellow) information in the graph. The contrastive learning can thus contrast two views at the same or different scales, which leads to two categories of algorithm: (1) same-scale contrasting, including local-local, context-context, and global-global contrasting; and (2) cross-scale contrasting, including local-context, local-global, and context-global contrasting.

3.3 Pretext Task

The contrastive learning aims to maximize the agreement of two jointly sampled positive pairs. Depending on the definition of a graph view, the scale of the view may be local, contextual, or global, corresponding to the node-level, subgraph-level, or graph-level information in the graph. Therefore, contrastive learning may contrast two graph views at the same or different scales, which leads to two categories: (1) Contrasting with the same-scale and (2) Contrasting with the cross-scale. The two views in the same-scale contrasting, either positive or negative pairs, are at the same scale, such as node-node and graph-graph pairs, while the two views in the cross-scale contrasting have different scales, such as node-subgraph or node-graph contrasting. We categorize existing methods from these two perspectives and present them in a unified framework as shown Fig. 4. In this section, due to space limitations, we present only some representative contrastive methods and place those relatively less important works in Appendix A.

3.3.1 Contrasting with the same-scale

The same-scale contrastive learning is further refined into three categories based on the different scales of the views: local-local, context-context, and global-global contrasting.

3.3.1.1 Global-Global Contrasting

GraphCL [25]. Four types of graph augmentations {𝒯k}k=14\{\mathcal{T}_{k}\}_{k=1}^{4} are applied to incorporate various priors: (1) Node Dropping 𝒯1()\mathcal{T}_{1}(\cdot); (2) Edge Perturbation 𝒯2()\mathcal{T}_{2}(\cdot); (3) Attribute Masking 𝒯3()\mathcal{T}_{3}(\cdot); (4) Subgraph Sampling 𝒯4()\mathcal{T}_{4}(\cdot). Given a graph gi=(𝐀i,𝐗i)𝒢g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}, it first applies a series of graph augmentations 𝒯()\mathcal{T}(\cdot) randomly selected from {𝒯k}k=14\{\mathcal{T}_{k}\}_{k=1}^{4} to generate an augmented graph g~i=(𝐀~i,𝐗~i)=𝒯(𝐀i,𝐗i)\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i}), and then learns to predict whether two graphs originate from the same graph or not. Specifically, a shared graph-level encoder fγ()f_{\gamma}(\cdot) is applied to obtain graph-level representations 𝐡gi=fγ(𝐀i,𝐗i)\mathbf{h}_{g_{i}}=f_{\gamma}(\mathbf{A}_{i},\mathbf{X}_{i}) and 𝐡~g~i=fγ(𝐀~i,𝐗~i)\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}=f_{\gamma}(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i}), respectively. Finally, the learning objective is defined as follows

maxθ1|𝒢|gi𝒢(𝐡gi,𝐡~g~i)\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right) (26)

Contrastive Self-supervised Learning (CSSL) [34] follows a very similar framework to GraphGL, differing only in the way the data is augmented. Along with node dropping, it also considers node insertion as an important augmentation strategy. Specifically, it randomly selects a strongly-connected subgraph SS, removes all edges in SS, adds a new node viv_{i}, and adds an edge between viv_{i} and each node in SS.

Label Contrastive Coding (LCC) [45] is proposed to encourage intra-class compactness and inter-class separability. To power contrastive learning, LLC introduces a dynamic label memory bank and a momentum updated encoder. Specifically, the query graph (gq,yq)(g_{q},y_{q}) and key graph (gk,yk)(g_{k},y_{k}) are encoded by two graph-level encoder fγq()f_{\gamma_{q}}(\cdot) and fγk()f_{\gamma_{k}}(\cdot) to obtain graph-level representations 𝐡gq\mathbf{h}_{g_{q}} and 𝐡gk\mathbf{h}_{g_{k}} respectively. If 𝐡gq\mathbf{h}_{g_{q}} and 𝐡gk\mathbf{h}_{g_{k}} have the same label, they are considered as the positive pair, otherwise, they are the negative pair. The label contrastive loss encourages the model to distinguish the positive pair from the negative pair. For the encoded query (gq,yq)(g_{q},y_{q}), its label contrastive loss is calculated by

maxγqlogi=1m𝕀yi=yqexp(𝐡gq𝐡gk(i)/τ)i=1mexp(𝐡gq𝐡gk(i)/τ)\max_{\gamma_{q}}\log\frac{\sum_{i=1}^{m}\mathbb{I}_{y_{i}=y_{q}}\cdot\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}{\sum_{i=1}^{m}\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)} (27)

where mm is the size of memory bank, τ\tau is the temperature hyperparameter, and 𝕀yi=yq\mathbb{I}_{y_{i}=y_{q}} is an indicator function to determine whether the label of ii-th key graph gk(i)g_{k}^{(i)} in the memory bank is the same as yqy_{q}. The parameter γk\gamma_{k} of fγk()f_{\gamma_{k}}(\cdot) follows a momentum-based update mechanism as Moco [2], given by

γkαγk+(1α)γq\gamma_{k}\longleftarrow\alpha\gamma_{k}+(1-\alpha)\gamma_{q} (28)

where α[0,1)\alpha\in[0,1) is the momentum weight to control the speed of γk\gamma_{k} evolving.

3.3.1.2 Context-Context Contrasting

Graph Contrastive Coding (GCC) [38] is a graph self-supervised pre-training framework, that captures the universal graph topological properties across multiple graphs. Specifically, it first samples multiple subgraphs for each graph g𝒢g\in\mathcal{G} by random walk and collect them in to a memory bank 𝒮\mathcal{S}. Then the query subgraph gq𝒮g_{q}\in\mathcal{S} and key subgraph gk𝒮g_{k}\in\mathcal{S} are encoded by two graph-level encoders fγq()f_{\gamma_{q}(\cdot)} and fγk()f_{\gamma_{k}}(\cdot) to obtain graph-level representations 𝐡gq\mathbf{h}_{g_{q}} and 𝐡gk\mathbf{h}_{g_{k}}, respectively. If gqg_{q} and gkg_{k} are sampled from the same graph, they are considerd as the positive pair, otherwise they are the negative pair. For the encoded query (gq,yq)(g_{q},y_{q}) where yqy_{q} is the index of graph it sampled from, its graph contrastive loss is calculated by

maxγqlogi=1|𝒮|𝕀yi=yqexp(𝐡gq𝐡gk(i)/τ)i=1|𝒮|exp(𝐡gq𝐡gk(i)/τ)\max_{\gamma_{q}}\log\frac{\sum_{i=1}^{|\mathcal{S}|}\mathbb{I}_{y_{i}=y_{q}}\cdot\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}{\sum_{i=1}^{|\mathcal{S}|}\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)} (29)

where 𝕀yi=yq\mathbb{I}_{y_{i}=y_{q}} is an indicator function to determine whether the ii-th key graph gk(i)g_{k}^{(i)} in the memory bank and query graph gqg_{q} are sampled from the same graph. The parameter γk\gamma_{k} of fγk()f_{\gamma_{k}}(\cdot) follows a momentum-based updating as in Equ. 28.

3.3.1.3 Local-Local Contrasting

GRACE [26]. Rather than contrasting global-global views as GraphCL [25] and CSSL [34], GRACE focuses on contrasting views at the node level. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first generates two augmentatd graphs g(1)=(𝐀(1),𝐗(1))=𝒯1(𝐀,𝐗)g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X}) and g(2)=(𝐀(1),𝐗(2))=𝒯2(𝐀,𝐗)g^{(2)}=(\mathbf{A}^{(1)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X}). Then it applies a shared encoder fθ()f_{\theta}(\cdot) to generate their node embedding matrices 𝐇(1)=fθ(𝐀(1),𝐗(1))\mathbf{H}^{(1)}=f_{\theta}(\mathbf{A}^{(1)},\mathbf{X}^{(1)}) and 𝐇(2)=fθ(𝐀(2),𝐗(2))\mathbf{H}^{(2)}=f_{\theta}(\mathbf{A}^{(2)},\mathbf{X}^{(2)}). Finally, the pairwise objective for each positive pair (𝐡i(1),𝐡i(2))(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i}) is defined as follows

(𝐡i(1),𝐡i(2))=loge𝒟(𝐡i(1),𝐡i(2))/τe𝒟(𝐡i(1),𝐡i(2))/τ+Neg\mathcal{L}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})=\log\frac{e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})/\tau}}{e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})/\tau}+Neg} (30)

where NegNeg is defined as

Neg=k=1N𝟏ki[e𝒟(𝐡i(1),𝐡k(1))/τ+e𝒟(𝐡i(1),𝐡k(2))/τ]Neg=\sum_{k=1}^{N}\mathbf{1}_{k\neq i}\left[e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(1)}_{k})/\tau}+e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{k})/\tau}\right] (31)

where e𝒟(𝐡i(1),𝐡k(1))/τe^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(1)}_{k})/\tau} is the intra-view negative pair and e𝒟(𝐡i(1),𝐡k(2))/τe^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{k})/\tau} is the inter-view negative pair. The overall objective to be maximized is then defined as,

maxθ12Ni=1N[(𝐡i(1),𝐡i(2))+(𝐡i(2),𝐡i(1))]\max_{\theta}\frac{1}{2N}\sum_{i=1}^{N}\left[\mathcal{L}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})+\mathcal{L}(\mathbf{h}^{(2)}_{i},\mathbf{h}^{(1)}_{i})\right] (32)

GCA [40] and GROC [43] adopt the same framework and objective as GRACE but with more flexible and adaptive data augmentation strategies. The framework proposed by SEPT [46] is similar to GRACE, but it is specifically designed for the specific downstream task (recommendation) by combining cross-view contrastive learning with semi-supervised tri-training. Technically, SEPT first augments the user data with the user social information, and then it builds three graph encoders upon the augmented views, with one for recommendation and the other two used to predict unlabeled users. Given a certain user, SEPT takes those nodes whose predicted labels are highly consistent with the target user as positive samples and then encourages the consistency between the target user and positive samples.

Cross-layer Contrasting (GMI) [16]. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), a graph encoder fθ()f_{\theta}(\cdot) is applied to obtain the node embedding matrix 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}). Then the Cross-layer Node Contrasting can be defined as

maxθ1Ni=1N(𝐡i,𝐱i)\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{x}_{i}\right) (33)

where the negative samples to contrast with 𝐡i\mathbf{h}_{i} is Neg(𝐡i)={𝐱jvj𝒩i}Neg(\mathbf{h}_{i})=\{\mathbf{x}_{j}\mid v_{j}\in\mathcal{N}_{i}\}. Similarly, the Cross-layer Edge Contrasting can be defined as

maxθ1Ni=1Nvj𝒩i(𝐰i,j,𝐀i,j)\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\sum_{v_{j}\in\mathcal{N}_{i}}\mathcal{MI}\left(\mathbf{w}_{i,j},\mathbf{A}_{i,j}\right) (34)

where 𝐰i,j=σ(𝐡i𝐡jT)\mathbf{w}_{i,j}=\sigma(\mathbf{h}_{i}\mathbf{h}_{j}^{T}), and the negative samples to contrast with 𝐰i,j\mathbf{w}_{i,j} are Neg(𝐰i,j)={𝐀i,kvk𝒩iandkj}Neg(\mathbf{w}_{i,j})=\{\mathbf{A}_{i,k}\mid v_{k}\in\mathcal{N}_{i}\quad\text{and}\quad k\neq j\}.

STDGI [28] extents the idea of mutual information maximization to spatial-temporal graphs. Specifically, given two graphs gt=(𝐀,𝐗(t))g_{t}=(\mathbf{A},\mathbf{X}^{(t)}) and gt+k=(𝐀,𝐗(t+k))g_{t+k}=(\mathbf{A},\mathbf{X}^{(t+k)}) at the time tt and t+kt+k, a shared graph encoder fθ()f_{\theta}(\cdot) is applied to obtain the node embedding matrix 𝐇(t)=fθ(𝐀,𝐗(t))\mathbf{H}^{(t)}=f_{\theta}(\mathbf{A},\mathbf{X}^{(t)}). Besides, it generates an augmentatd graph g~t+k=(𝐀,𝐗~(t+k))=𝒯(𝐀,𝐗(t+k))\widetilde{g}_{t+k}=(\mathbf{A},\widetilde{\mathbf{X}}^{(t+k)})=\mathcal{T}(\mathbf{A},\mathbf{X}^{(t+k)}) by randomly permuting the node features. Finally, the learning objective is defined as follows

maxθ1Ni=1N(𝐡i(t),𝐱i(t+k))\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\mathcal{MI}\left(\mathbf{h}^{(t)}_{i},\mathbf{x}^{(t+k)}_{i}\right) (35)

where the negative samples to contrast with 𝐡i(t)\mathbf{h}^{(t)}_{i} is Neg(𝐡i(t))=𝐱~i(t+k)Neg(\mathbf{h}^{(t)}_{i})=\widetilde{\mathbf{x}}^{(t+k)}_{i}.

BGRL [27]. Inspired by BYOL, BGRL proposes to perform the self-supervised learning that does not require negative samples, thus getting rid of the potentially quadratic bottleneck. Specifically, given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first generates two augmentatd graph views g(1)=(𝐀(1),𝐗(1))=𝒯1(𝐀,𝐗)g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X}) and g(2)=(𝐀(1),𝐗(2))=𝒯2(𝐀,𝐗)g^{(2)}=(\mathbf{A}^{(1)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X}). Then it applies two graph encoders fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) to generate their node embedding matrices 𝐇(1)=fθ1(𝐀(1),𝐗(1))\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A}^{(1)},\mathbf{X}^{(1)}) and 𝐇(2)=fθ2(𝐀(2),𝐗(2))\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A}^{(2)},\mathbf{X}^{(2)}). Moreover, a node-level prediction head gω()g_{\omega}(\cdot) is used to output 𝐙(1)=gω(𝐇(1))\mathbf{Z}^{(1)}=g_{\omega}(\mathbf{H}^{(1)}). Finally, the learning objective is defined as follows

maxθ1,ω1Ni=1N𝐳i(1)(𝐡i(2))T𝐳i(1)𝐡i(2)\max_{\theta_{1},\omega}\frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{z}^{(1)}_{i}(\mathbf{h}^{(2)}_{i})^{T}}{\|\mathbf{z}^{(1)}_{i}\|\|\mathbf{h}^{(2)}_{i}\|} (36)

where the parameter θ2\theta_{2} are updated as an exponential moving average (EMA) of parameters θ1\theta_{1}, as done in Equ. 28.

SelfGNN [35] differs from BGRL only in the definition of the objective function. Unlike Equ. 36, SelfGNN defines the implicit contrastive term directly in the form of MSE,

minθ1,ω1Ni=1N𝐳i(1)𝐡i(2)2\min_{\theta_{1},\omega}\frac{1}{N}\sum_{i=1}^{N}\|\mathbf{z}^{(1)}_{i}-\mathbf{h}^{(2)}_{i}\|^{2} (37)

HeCo [47]. Consider a meta-path Φk\Phi_{k} form the meta-path set {Φk}k=1K\{\Phi_{k}\}_{k=1}^{K}, if there exist a meta-path Φk\Phi_{k} between node viv_{i} and node vjv_{j}, then vjv_{j} can be considered as in the meta-path neighborhood 𝒩iΦk\mathcal{N}_{i}^{\Phi_{k}}of node viv_{i}, which yields a meta-path based adjacent matrix 𝐀Φk\mathbf{A}^{\Phi_{k}}. The HeCo first applies two graph encoder fθ1sc()f_{\theta_{1}}^{sc}(\cdot) and fθ2mp()f_{\theta_{2}}^{mp}(\cdot) to obtain node embedding matrices 𝐇sc=fθ1sc(𝐀,𝐗)\mathbf{H}^{sc}=f_{\theta_{1}}^{sc}(\mathbf{A},\mathbf{X}) and 𝐇mp=fθ2ml({𝐀Φk}k=1K,𝐗)\mathbf{H}^{mp}=f_{\theta_{2}}^{ml}(\{\mathbf{A}^{\Phi_{k}}\}_{k=1}^{K},\mathbf{X}). To define positive and negative samples, HeCo first defines a function i(j)=k=1K𝕀(j𝒩iΦk)\mathbb{C}_{i}(j)=\sum_{k=1}^{K}\mathbb{I}\left(j\in\mathcal{N}_{i}^{\Phi_{k}}\right) to count the number of meta-paths connecting nodes viv_{i} and vjv_{j}. Then it constructs a set 𝒮i={jj𝒱 and i(j)0}\mathcal{S}_{i}=\left\{j\mid j\in\mathcal{V}\text{ and }\mathbb{C}_{i}(j)\neq 0\right\} and sort it in the descending order based on the value of i(j)\mathbb{C}_{i}(j). Next it selects the top TposT_{pos} nodes from 𝒮i\mathcal{S}_{i} as positive samples i\mathbb{P}_{i} and treat the rest as negative samples i\mathbb{N}_{i} directly. Finally, the learning objective can be defined as follows

maxθ1,θ21Ni=1Nlogvjie𝒟(𝐡isc,𝐡jmp)/τvk{ii}e𝒟(𝐡isc,𝐡kmp)/τ\max_{\theta_{1},\theta_{2}}\frac{1}{N}\sum_{i=1}^{N}\log\frac{\sum_{v_{j}\in\mathbb{P}_{i}}e^{\mathcal{D}(\mathbf{h}^{sc}_{i},\mathbf{h}^{mp}_{j})/\tau}}{\sum_{v_{k}\in\{\mathbb{P}_{i}\cup\mathbb{N}_{i}\}}e^{\mathcal{D}(\mathbf{h}^{sc}_{i},\mathbf{h}^{mp}_{k})/\tau}} (38)

3.3.2 Contrasting with the cross-scale

Based on different scales of two views, we further refined the scope of cross-scale contrastive into three categories: local-global, local-context, and context-global contrasting.

3.3.2.1 Local-Global Contrasting

Deep Graph Infomax (DGI) [9] is proposed to contrast the patch representations and corresponding high-level summary of graphs. First, it applies an augmentation transformation 𝒯()\mathcal{T}(\cdot) to obtain an augmented graph g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then it passes these two graphs through two graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) to obtain node embedding matrices 𝐇~=fθ1(𝐀~,𝐗~)\widetilde{\mathbf{H}}=f_{\theta_{1}}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}) and 𝐇=fθ2(𝐀,𝐗)\mathbf{H}=f_{\theta_{2}}(\mathbf{A},\mathbf{X}), respectively. Beside, a READOUT\operatorname{READOUT} function is applied to obtain the graph-level representaion 𝐡~g~=READOUT(𝐇~)\widetilde{\mathbf{h}}_{\widetilde{g}}=\operatorname{READOUT}(\widetilde{\mathbf{H}}). Finally, the learning objective is defined as follows

maxθ1,θ21Nvi𝒱(𝐡~g~,𝐡i)\max_{\theta_{1},\theta_{2}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\widetilde{\mathbf{h}}_{\widetilde{g}},\mathbf{h}_{i}\right) (39)

where 𝐡i\mathbf{h}_{i} is the node embedding of node viv_{i}, and the negative samples to contrast with 𝐡~g~\widetilde{\mathbf{h}}_{\widetilde{g}} is Neg(𝐡~g~)={𝐡j}vj𝒱,jiNeg(\widetilde{\mathbf{h}}_{\widetilde{g}})=\{\mathbf{h}_{j}\}_{v_{j}\in\mathcal{V},j\neq i}.

MVGRL [18] maximize the the mutual information between the cross-view representations of nodes and graphs. Given a g=(𝐀,𝐗)𝒢g=(\mathbf{A},\mathbf{X})\in\mathcal{G}, it first applies an augmentation to obtain g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}) and then samples two subgraph g(1)=(𝐀(1),𝐗(1))=𝒯1(𝐀,𝐗)g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X}) and g(2)=(𝐀(2),𝐗(2))=𝒯2(𝐀,𝐗)g^{(2)}=(\mathbf{A}^{(2)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X}) from it. Then two graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) and a prejection head gω1()g_{\omega_{1}}(\cdot) are applied to obtain node embedding matrices 𝐇(1)=gω1(fθ1(𝐀(1),𝐗(1)))\mathbf{H}^{(1)}=g_{\omega_{1}}(f_{\theta_{1}}(\mathbf{A}^{(1)},\mathbf{X}^{(1)})) and 𝐇(2)=gω1(fθ2(𝐀(2),𝐗(2)))\mathbf{H}^{(2)}=g_{\omega_{1}}(f_{\theta_{2}}(\mathbf{A}^{(2)},\mathbf{X}^{(2)})). In addition, a READOUT\operatorname{READOUT} function and another prejection head gω2()g_{\omega_{2}}(\cdot) are use to obtain graph-level representations 𝐡g(1)=fω2(READOUT(𝐇(1)))\mathbf{h}_{g}^{(1)}=f_{\omega_{2}}(\operatorname{READOUT}(\mathbf{H}^{(1)})) and 𝐡g(2)=fω2(READOUT(𝐇(2)))\mathbf{h}_{g}^{(2)}=f_{\omega_{2}}(\operatorname{READOUT}(\mathbf{H}^{(2)})). The learning objective is defined as follows:

maxθ1,θ2,ω1,ω21Nvi𝒱[(𝐡g(1),𝐡i(2))+(𝐡g(2),𝐡i(1))]\displaystyle\max_{\theta_{1},\theta_{2},\omega_{1},\omega_{2}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\left[\mathcal{MI}(\mathbf{h}_{g}^{(1)},\mathbf{h}_{i}^{(2)})+\mathcal{MI}(\mathbf{h}_{g}^{(2)},\mathbf{h}_{i}^{(1)})\right] (40)

where the negative samples to contrast with 𝐡g(1)\mathbf{h}_{g}^{(1)} is Neg(𝐡g(1))={𝐡j(2)}vj𝒱,jiNeg(\mathbf{h}_{g}^{(1)})=\{\mathbf{h}_{j}^{(2)}\}_{v_{j}\in\mathcal{V},j\neq i} and the negative samples to contrast with 𝐡g(2)\mathbf{h}_{g}^{(2)} is Neg(𝐡g(2))={𝐡j(1)}vj𝒱,jiNeg(\mathbf{h}_{g}^{(2)})=\{\mathbf{h}_{j}^{(1)}\}_{v_{j}\in\mathcal{V},j\neq i}.

3.3.2.2 Local-Context Contrasting

SUBG-CON [19] is proposed by utilizing the strong correlation between central (anchor) nodes and their surrounding subgraphs to capture contextual structure information. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), SUBG-CON first picks up an anchor node set 𝒮\mathcal{S} from 𝒱\mathcal{V} and then samples their context subgraph {gi=(𝐀(i),𝐗(i))}i=1|𝒮|\{g_{i}=(\mathbf{A}^{(i)},\mathbf{X}^{(i)})\}_{i=1}^{|\mathcal{S}|} by the importance sampling strategy. Then a shared graph encoder fθ()f_{\theta}(\cdot) and a READOUT\operatorname{READOUT} function are applied to obtain node embedding matrices {𝐇(1),𝐇(2),,𝐇(|𝒱|)}\{\mathbf{H}^{(1)},\mathbf{H}^{(2)},\cdots,\mathbf{H}^{(|\mathcal{V}|)}\} where 𝐇(i)=fθ(𝐀(i),𝐗(i))\mathbf{H}^{(i)}=f_{\theta}(\mathbf{A}^{(i)},\mathbf{X}^{(i)}) and graph-level representations {𝐡g1,𝐡g2,,𝐡g|𝒱|}\{\mathbf{h}_{g_{1}},\mathbf{h}_{g_{2}},\cdots,\mathbf{h}_{g_{|\mathcal{V}|}}\} where 𝐡gi=READOUT(𝐇(i))\mathbf{h}_{g_{i}}=\operatorname{READOUT}(\mathbf{H}^{(i)}). Finally, the learning objective is defined as follows

maxθ1|𝒮|vi𝒮(𝐡i(i),𝐡gi)\max_{\theta}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\mathcal{MI}\left(\mathbf{h}^{(i)}_{i},\mathbf{h}_{g_{i}}\right) (41)

where 𝐡i(i)\mathbf{h}^{(i)}_{i} is the node representation of anchor node viv_{i} in the node embedding matrix 𝐇(i)\mathbf{H}^{(i)}. The negative samples to contrast with 𝐡i(i)\mathbf{h}^{(i)}_{i} is Neg(𝐡i(i))={𝐡gj}vj𝒮,jiNeg(\mathbf{h}^{(i)}_{i})=\{\mathbf{h}_{g_{j}}\}_{v_{j}\in\mathcal{S},j\neq i}.

Graph InfoClust (GIC) [48] relies on a framework similar to DGI [9]. However, in addition to contrast local-global views, GIC also maximize the MI between node representations and their corresponding cluster embeddings. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies an augmentation to obtain g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then a shared graph encoder fθ()f_{\theta}(\cdot) is applied to obtain node embedding matrices 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}) and 𝐇~=fθ(𝐀~,𝐗~)\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}). Furthermore, an unsupervised clustering algorithm is used to group nodes into KK clusters 𝒞={C1,C2,,CK}\mathcal{C}=\{C_{1},C_{2},\cdots,C_{K}\}, and it obtains the cluster centers by 𝝁k=1|Ck|viCk𝐡i\bm{\mu}_{k}=\frac{1}{|C_{k}|}\sum_{v_{i}\in C_{k}}\mathbf{h}_{i} where 1kK1\leq k\leq K. To compute the cluster embedding 𝐳i\mathbf{z}_{i} for each node viv_{i}, it applies a weighted average of the summaries of the cluster centers to which node viv_{i} belongs 𝐳i=σ(k=1Krik𝝁k)\mathbf{z}_{i}=\sigma\left(\sum_{k=1}^{K}r_{ik}\bm{\mu}_{k}\right), where rikr_{ik} is the probability that node viv_{i} is assigned to cluster kk, and is a soft-assignment value with krik=1,i\sum_{k}r_{ik}=1,\forall i. For example, ri,kr_{i,k} can be defined as ri,k=exp(𝐡i𝝁kT)j=1Kexp(𝐡i𝝁jT)r_{i,k}=\frac{\exp(\mathbf{h}_{i}\bm{\mu}_{k}^{T})}{\sum_{j=1}^{K}\exp(\mathbf{h}_{i}\bm{\mu}_{j}^{T})}. Finally, the learning objective is defined as follows

maxθ1Nvi𝒱(𝐡i,𝐳i)\max_{\theta}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{z}_{i}\right) (42)

where the negative samples to contrast with 𝐡i\mathbf{h}_{i} is Neg(𝐡i)={𝐳j}vj𝒱,jiNeg(\mathbf{h}_{i})=\{\mathbf{z}_{j}\}_{v_{j}\in\mathcal{V},j\neq i}.

3.3.2.3 Context-Global Contrasting

MICRO-Graph [39]. The key challenge to conducting subgraph-level contrastive is to sample semantically informative subgraphs. For molecular graphs, the graph motifs, which are frequently-occurring subgraph patterns (e.g., functional groups) can be exploited for better subgraph sampling. Specifically, the motif learning is formulated as a differentiable clustering problem, and EM-clustering is adopted to group significant subgraphs into several motifs, thus obtaining a motifs table. Given two graph g(1)=(𝐀(1),𝐗(1)),g(2)=(𝐀(2),𝐗(2))𝒢g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)}),g^{(2)}=(\mathbf{A}^{(2)},\mathbf{X}^{(2)})\in\mathcal{G}, it first applies a shared graph encoder fθ()f_{\theta}(\cdot) to learn their node embedding matrices 𝐇(1)=fθ(𝐀(1),𝐗(1))\mathbf{H}^{(1)}=f_{\theta}(\mathbf{A}^{(1)},\mathbf{X}^{(1)}) and 𝐇(2)=fθ(𝐀(2),𝐗(2))\mathbf{H}^{(2)}=f_{\theta}(\mathbf{A}^{(2)},\mathbf{X}^{(2)}). Then it leverages learned motifs table to sample KK motif-like subgraphs from g(1)g^{(1)} and g(2)g^{(2)} and obtain their correspongding embedding matrices {𝐇1(1),𝐇2(1),,𝐇K(1)}\{\mathbf{H}^{(1)}_{1},\mathbf{H}^{(1)}_{2},\cdots,\mathbf{H}^{(1)}_{K}\} and {𝐇1(2),𝐇2(2),,𝐇K(2)}\{\mathbf{H}^{(2)}_{1},\mathbf{H}^{(2)}_{2},\cdots,\mathbf{H}^{(2)}_{K}\}. Then a READOUT\operatorname{READOUT} function is applied to obtain graph-level and subgraph-level representations, denoted as 𝐡g(1)\mathbf{h}^{(1)}_{g}, {𝐡1(1),𝐡2(1),,𝐡K(1)}\{\mathbf{h}^{(1)}_{1},\mathbf{h}^{(1)}_{2},\cdots,\mathbf{h}^{(1)}_{K}\} and 𝐡g(2)\mathbf{h}^{(2)}_{g}, {𝐡1(2),𝐡2(2),,𝐡K(2)}\{\mathbf{h}^{(2)}_{1},\mathbf{h}^{(2)}_{2},\cdots,\mathbf{h}^{(2)}_{K}\}. Finally, the objective is defined as

maxθ1|𝒢|g𝒢k=1K[(𝐡g(1),𝐡k(1))+(𝐡g(2),𝐡k(2))]\displaystyle\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g\in\mathcal{G}}\sum_{k=1}^{K}\left[\mathcal{MI}(\mathbf{h}_{g}^{(1)},\mathbf{h}_{k}^{(1)})+\mathcal{MI}(\mathbf{h}_{g}^{(2)},\mathbf{h}_{k}^{(2)})\right] (43)

where the negative samples to contrast with 𝐡g(1)\mathbf{h}_{g}^{(1)} is Neg(𝐡g(1))={𝐡j(2)}j=1KNeg(\mathbf{h}_{g}^{(1)})=\{\mathbf{h}_{j}^{(2)}\}_{j=1}^{K} and the negative samples to contrast with 𝐡g(2)\mathbf{h}_{g}^{(2)} is Neg(𝐡g(2))={𝐡j(1)}j=1KNeg(\mathbf{h}_{g}^{(2)})=\{\mathbf{h}_{j}^{(1)}\}_{j=1}^{K}.

InfoGraph [49] aims to obtain embeddings at the whole graph level for self-supervised learning. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies an augmentation to obtain g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then a shared LL-layer graph encoder fθ()f_{\theta}(\cdot) is applied to learn node embedding matrix sequences {𝐇(l)}l=1L\{\mathbf{H}^{(l)}\}_{l=1}^{L} and {𝐇~(l)}l=1L\{\widetilde{\mathbf{H}}^{(l)}\}_{l=1}^{L} obtain from each layer. Then it concats the representations learned from each layer, 𝐡i=CONCAT({𝐡i(l)}l=1L)\mathbf{h}_{i}=\text{CONCAT}(\{\mathbf{h}^{(l)}_{i}\}_{l=1}^{L}) and 𝐡~i=CONCAT({𝐡~i(l)}l=1L)\widetilde{\mathbf{h}}_{i}=\text{CONCAT}(\{\widetilde{\mathbf{h}}^{(l)}_{i}\}_{l=1}^{L}), where 𝐡i(l)\mathbf{h}^{(l)}_{i} is the embedding of node viv_{i} in the node embedding matrix 𝐇(l)\mathbf{H}^{(l)} obtained from the ll-th layer of the graph encoder. In addition, a READOUT\operatorname{READOUT} function is used to obatain the graph-level representation 𝐡g=READOUT({𝐡i}i=1N)\mathbf{h}_{g}=\operatorname{READOUT}(\{\mathbf{h}_{i}\}_{i=1}^{N}). Finally, the learning objective is defined as follows

maxθg𝒢1|g|vig(𝐡g,𝐡i)\max_{\theta}\sum_{g\in\mathcal{G}}\frac{1}{|g|}\sum_{v_{i}\in g}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i}\right) (44)

where the negative samples to contrast with 𝐡g\mathbf{h}_{g} is node representations on the augmented graph Neg(𝐡g)={𝐡~i}vi𝒱Neg(\mathbf{h}_{g})=\{\widetilde{\mathbf{h}}_{i}\}_{v_{i}\in\mathcal{V}}.

BiGi [37] is specifically designed for bipartite graph, where the class label yi{0,1}y_{i}\in\{0,1\} of each node viv_{i} is already known. For a given g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies a structure-based augmentation to obtain g~=(𝐀~,𝐗)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\mathbf{X})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then a shared graph encoder fθ()f_{\theta}(\cdot) is applied to obtain 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}) and 𝐇~=fθ(𝐀~,𝐗)\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\mathbf{X}). Beisde, it can obtain the graph-level representation from 𝐇\mathbf{H} directly as follows

𝐡g=[σ(1|𝒱(1)|vi𝒱(1)𝐡)σ(1|𝒱(2)|vi𝒱(2)𝐡i)]\displaystyle\mathbf{h}_{g}=\Big{[}\sigma\Big{(}\frac{1}{|\mathcal{V}^{(1)}|}\sum_{v_{i}\in\mathcal{V}^{(1)}}\mathbf{h}\Big{)}\Big{\|}\sigma\Big{(}\frac{1}{|\mathcal{V}^{(2)}|}\sum_{v_{i}\in\mathcal{V}^{(2)}}\mathbf{h}_{i}\Big{)}\Big{]} (45)

where 𝒱(1)={vi|vi𝒱,yi=0}\mathcal{V}^{(1)}=\{v_{i}|v_{i}\in\mathcal{V},y_{i}=0\} and 𝒱(2)={vi|vi𝒱,yi=1}\mathcal{V}^{(2)}=\{v_{i}|v_{i}\in\mathcal{V},y_{i}=1\}. For a given edge (vi,vj)(v_{i},v_{j})\in\mathcal{E}, it first performs the ege-nets sampling to obtain two subgraph (centered at node viv_{i} and vjv_{j}), and then gets their node feature matrix 𝐇(i)\mathbf{H}^{(i)} and 𝐇(j)\mathbf{H}^{(j)} from 𝐇\mathbf{H} directly. Then a subgraph-level attention module (similar to GAT) is applied to obatain two subgraph-level representation 𝐡i=Attγ(𝐇(i))\mathbf{h}_{i}=Att_{\gamma}(\mathbf{H}^{(i)}) and 𝐡j=Attγ(𝐇(j))\mathbf{h}_{j}=Att_{\gamma}(\mathbf{H}^{(j)}). Finally, 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j} are fused to obtain 𝐡i,j\mathbf{h}_{i,j} = [𝐡i𝐡j][\mathbf{h}_{i}\|\mathbf{h}_{j}]. Similarity, it can obtain the fused representation 𝐡~i,j\widetilde{\mathbf{h}}_{i,j} from 𝐇~\widetilde{\mathbf{H}}. Finally, the learning objective is defined as follows:

maxθ,γ1||(vi,vj)(𝐡g,𝐡i,j)\max_{\theta,\gamma}\frac{1}{|\mathcal{E}|}\sum_{(v_{i},v_{j})\in\mathcal{E}}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i,j}\right) (46)

where the negative samples is defined as Neg(𝐡g)=𝐡~i,jNeg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}_{i,j}.

3.4 Contrasive Objectives

The main way to optimize the contrastive learning is to treat two representations (views) 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j} as random variables and maximize their mutual information, given by

(𝐡i,𝐡j)=𝔼p(𝐡i,𝐡j)[logp(𝐡i,𝐡j)p(𝐡i)p(𝐡j)]\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j})=\mathbb{E}_{p(\mathbf{h}_{i},\mathbf{h}_{j})}\left[\log\frac{p(\mathbf{h}_{i},\mathbf{h}_{j})}{p(\mathbf{h}_{i})p(\mathbf{h}_{j})}\right] (47)

To computationally estimate the mutual information in contrastive learning, three lower-bound forms of the mutual information are derived, and then the mutual information is maximized indirectly by maximizing their lower-bounds.

Donsker-Varadhan Estimator [50] is one of the lower-bound to the mutual information, defined as

DV(𝐡i,𝐡j)=\displaystyle\mathcal{MI}_{DV}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)= 𝔼p(𝐡i,𝐡j)[𝒟(𝐡i,𝐡j)]\displaystyle\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\left[\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right] (48)
log𝔼p(𝐡i)p(𝐡j)[e𝒟(𝐡i,𝐡j)]\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\left[e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\right]

where p(𝐡i,𝐡j)p(\mathbf{h}_{i},\mathbf{h}_{j}) denotes the joint distribution of two representations 𝐡i\mathbf{h}_{i}, 𝐡j\mathbf{h}_{j}, and p(𝐡i)p(𝐡j)p(\mathbf{h}_{i})p(\mathbf{h}_{j}) denotes the product of marginals. 𝒟:q×q\mathcal{D}:\mathbb{R}^{q}\times\mathbb{R}^{q}\rightarrow\mathbb{R} is a discriminator that maps two views 𝐡i\mathbf{h}_{i}, 𝐡j\mathbf{h}_{j} to an agreement score. Generally, the discriminator 𝒟\mathcal{D} can optionally apply an additional prediction head gω()g_{\omega}(\cdot) to map 𝐡i\mathbf{h}_{i} to 𝐳i=gω(𝐡i)\mathbf{z}_{i}=g_{\omega}(\mathbf{h}_{i}) before computing agreement scores, where gω()g_{\omega}(\cdot) can be a linear mapping, a nonlinear mapping (e.g., MLP), or even a non-parametric identical mapping (𝐳i\mathbf{z}_{i} = 𝐡i\mathbf{h}_{i}). The discriminator 𝒟\mathcal{D} can be taken in various forms, i.e., the standard inner product 𝒟(𝐳i,𝐳j)=𝐳iT𝐳j\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\mathbf{z}_{i}^{T}\mathbf{z}_{j}, the inner product 𝒟(𝐳i,𝐳j)=𝐳iT𝐳j/τ\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\mathbf{z}_{i}^{T}\mathbf{z}_{j}/\tau with temperature parameter τ\tau, the cosine similarity 𝒟(𝐳i,𝐳j)=𝐳iT𝐳j𝐳i𝐳j\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\frac{\mathbf{z}_{i}^{T}\mathbf{z}_{j}}{||\mathbf{z}_{i}||||\mathbf{z}_{j}||}, or the gaussian similarity 𝒟(𝐳i,𝐳j)=exp(𝐳i𝐳j222σ2)\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\exp\left(-\frac{||\mathbf{z}_{i}-\mathbf{z}_{j}||_{2}^{2}}{2\sigma^{2}}\right).

Jensen-Shannon Estimator. Replacing the KL-divergence in Equ. 47 with the JS-divergence, it derives another Jensen-Shannon (JS) estimator [51] which can estimate and optimize the mutual information more efficiently. The Jensen-Shannon (JS) estimator is defined as

JS\displaystyle\mathcal{MI}_{JS} (𝐡i,𝐡j)=𝔼p(𝐡i,𝐡j)[log(𝒟(𝐡i,𝐡j))]\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\bigg{[}\log\left(\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j}\right))\bigg{]} (49)
log𝔼p(𝐡i)p(𝐡j)[log(1𝒟(𝐡i,𝐡j))]\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\bigg{[}log\Big{(}1-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]}

Let 𝒟(𝐡i,𝐡j)=sigmod(𝒟(𝐡i,𝐡j))\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\text{sigmod}(\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j})\right), the Equ.49 can be re-writed as a softplus (SP) version [18, 49], as follows

SP\displaystyle\mathcal{MI}_{SP} (𝐡i,𝐡j)=𝔼p(𝐡i,𝐡j)[sp(𝒟(𝐡i,𝐡j))]\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\bigg{[}-sp\Big{(}-\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]} (50)
log𝔼p(𝐡i)p(𝐡j)[sp(𝒟(𝐡i,𝐡j))]\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\bigg{[}sp\Big{(}\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]}

where sp(x)=log(1+ex)sp(x)=\log\left(1+e^{x}\right).

InfoNCE Estimator. InfoNCE [52] is one of the most popular lower-bound to the mutual information, defined as

NCE\displaystyle\mathcal{MI}_{NCE} (𝐡i,𝐡j)=𝔼p(𝐡i,𝐡j)[𝒟(𝐡i,𝐡j)\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\Bigg{[}\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right) (51)
𝔼K𝒫N[log1N𝐡jKe𝒟(𝐡i,𝐡j)]]\displaystyle-\mathbb{E}_{K\sim\mathcal{P}^{N}}\Big{[}\log\frac{1}{N}\sum_{\mathbf{h}_{j}^{\prime}\in K}e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)}\Big{]}\Bigg{]}

where KK consists of NN random variables sampled from a n identical and independent distribution. NT-Xent loss [53] is a special version of the InfoNCE loss, which defines the discriminator 𝒟\mathcal{D} as 𝒟(𝐡i,𝐡j)=𝐡iT𝐡j/τ\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})=\mathbf{h}_{i}^{T}\mathbf{h}_{j}/\tau with temperature parameter τ\tau. Taking graph classification as an example, fγ()f_{\gamma}(\cdot) is a graph encoder that maps a graph g=(𝐀,𝐗)𝒢g=(\mathbf{A},\mathbf{X})\in\mathcal{G} to a graph-level representation 𝐡g=fγ(𝐀,𝐗)\mathbf{h}_{g}=f_{\gamma}(\mathbf{A},\mathbf{X}). The InfoNCE is in practice computed on a mini-batch 𝑩\boldsymbol{B} of size N+1N+1, then the Equ. 51 can be re-writed (with logN\log N discarded) as

NCE=1N+1(𝐀,𝐗)𝑩loge𝒟(𝐡i,𝐡j)(𝐀,𝐗)𝑩e𝒟(𝐡i,𝐡j)\displaystyle\mathcal{MI}_{NCE}=-\frac{1}{N+1}\sum_{(\mathbf{A},\mathbf{X})\in\boldsymbol{B}}\log\frac{e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}}{\sum_{(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\in\boldsymbol{B}}e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)}} (52)

where 𝐡i\mathbf{h}_{i}, 𝐡j\mathbf{h}_{j} are the positive pair of views that comes from the same graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), and 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j}^{\prime} are the negative pair of views that are computed from g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}) and g=(𝐀,𝐗)g^{\prime}=(\mathbf{A}^{\prime},\mathbf{X}^{\prime}) identically and independently.

Triplet Margin Loss. While all three MI estimators above estimate the lower bound on mutual information, mutual information maximization has been shown not to be a must for contrastive learning [54]. For example, the triplet margin loss [55] can also be used to optimize the contrastive learning, but it is not an MI-based contrastive objective, and optimizing it does not guarantee the maximization of mutual information. The triplet margin loss is defined as

triplet\displaystyle\mathcal{L}_{triplet} (𝐡i,𝐡j)=𝔼[(𝐀,𝐗),(𝐀,𝐗)]𝒢×𝒢[\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[} (53)
max{𝒟(𝐡i,𝐡j)𝒟(𝐡i,𝐡j)+ϵ,0}]\displaystyle\max\{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)+\epsilon,0\}\Big{]}

where the triplet margin loss does not directly minimize the agreement of the negative sample pair 𝒟(𝐡i,𝐡j)\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}), but only ensures that the agreement of the negative sample pair is smaller than that of the positive sample pair by a margin value ϵ\epsilon. The idea behind is that when the negative samples are sufficiently far apart, i.e., the agreement between them is small enough, there is no need to further reduce their agreement, which helps to focus the training more on those hard samples that are hard to distinguish. The quadruplet loss [56] further considers imposing constraints on inter-class samples on top of the triplet margin loss, defined as:

Quadruplet\displaystyle\mathcal{L}_{Quadruplet} (𝐡i,𝐡j)=triplet+𝔼[(𝐀,𝐗),(𝐀,𝐗)]𝒢×𝒢[\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathcal{MI}_{triplet}+\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[} (54)
max{𝒟(𝐡i,𝐡j)𝒟(𝐡i,𝐡j)+ϵ,0}]\displaystyle\max\{\mathcal{D}\left(\mathbf{h}_{i}^{\prime},\mathbf{h}_{j}^{\prime}\right)-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)+\epsilon^{\prime},0\}\Big{]}

where ϵ\epsilon^{\prime} is a smaller margin value than ϵ\epsilon. The quadruplet loss differs from the triplet margin loss in that it not only uses an anchor-based sampling strategy but also samples negative samples in a more random way, which helps to learn more distinguishable inter-class boundaries.

RankMI Loss. While both triplet margin loss and quadruplet loss ignore the lower bound of the mutual information, the RankMI Loss [57] seamlessly incorporates information-theoretic approaches into the representation learning and maximizes the mutual information among samples belonging to the same category, defined as:

RankMI\displaystyle\mathcal{MI}_{RankMI} (𝐡i,𝐡j)=𝔼[(𝐀,𝐗),(𝐀,𝐗)]𝒢×𝒢[\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[} (55)
𝒟(𝐡i,𝐡j)+log(2e𝒟(𝐡i,𝐡j))]\displaystyle\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})+\log(2-e^{\mathcal{D}(\mathbf{h}_{i}^{\prime},\mathbf{h}_{j}^{\prime})})\Big{]}

As RankMI can incorporate margins based on random positive and negative pairs, the quadruple loss can be considered as a special case of RankMI with a fixed margin.

4 Generative Learning

Compared with contrastive methods, the generative methods shown in Fig. 1(b) are based on generative models and treat rich information embedded in the data as a natural self-supervision. In generative methods, the prediction head gω()g_{\omega}(\cdot) is usually called the graph decoder, which is used to perform graph reconstruction. Categorized by how the reconstruction is performed, we summarize generative methods into two categories: (1) graph autoencoding that performs reconstruction in a once-for-all manner; (2) graph autoregressive that iteratively performs reconstruction. In this section, due to space limitations, we present only some representative generative methods and place those relatively less important works in Appendix B.

4.1 Graph Autoencoding

Since the autoencoder [58] was proposed, it has been widely used as a basic architecture for a variety of image and text data. Given restricted access to the graph, the graph autoencoder is trained to reconstruct certain parts of the input data. Depending on which parts of the input graph are given or restricted, various pretext tasks have been proposed, which will be reviewed one by one next.

Graph Completion [12]. Motivated by the success of image inpainting, graph completion is proposed as a pretext task for graph data. It first masks one node by removing part of its features, and then aims to reconstruc masked features by feeding unmasked node features in the neighborhood. For a given node viv_{i}, it randomly masks its features 𝐱i\mathbf{x}_{i} with 𝐱^i=𝐱i𝐦i\widehat{\mathbf{x}}_{i}=\mathbf{x}_{i}\odot\mathbf{m}_{i} to obtain a new node feature matrix 𝐗^\widehat{\mathbf{X}}, and then aim to reconstruct masked features. More formally,

ssl(θ,𝐀,𝐗^)=fθ(𝐀,𝐗^)vi𝐱i2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2} (56)

Here, it just takes one node as an example, and the reconstruction of multiple nodes can be considered in practice. Note that only those unmasked neighborhood nodes can be used to reconstruct the target node for graph completion.

Node Attribute Masking [10] is similar to Graph Completion, but it reconstructs the features of multiple nodes simultaneously, and it no longer requires that the neighboring node features used for reconstruction must be unmasked.

Edge Attribute Masking [11]. This pretext task is specifically designed for graph data with known edge features, and it enables GNN to learn more edge relation information. Similarly, it first randomly masks the features of a edge set e\mathcal{M}_{e}. Specifically, it obtains a masked edge feature matrix 𝐗^e\widehat{\mathbf{X}}^{e} where 𝐱^i,je=𝐱i,je𝐦i,j\widehat{\mathbf{x}}^{e}_{i,j}=\mathbf{x}^{e}_{i,j}\odot\mathbf{m}_{i,j} for (vi,vj)e(v_{i},v_{j})\in\mathcal{M}_{e}. More formally,

ssl(θ,𝐀,𝐗,𝐗^e)=1|e|(vi,vj)e𝐱¯i,je𝐱i,je2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X},\widehat{\mathbf{X}}^{e}\right)=\frac{1}{|\mathcal{M}_{e}|}\sum_{(v_{i},v_{j})\in\mathcal{M}_{e}}\left\|\overline{\mathbf{x}}^{e}_{i,j}-\mathbf{x}^{e}_{i,j}\right\|^{2} (57)

where 𝐱¯i,je=(𝐗¯e)i,j\overline{\mathbf{x}}^{e}_{i,j}=(\overline{\mathbf{X}}^{e})_{i,j} and 𝐗¯e=fθ(𝐀,𝐗,𝐗^e)\overline{\mathbf{X}}^{e}=f_{\theta}(\mathbf{A},\mathbf{X},\widehat{\mathbf{X}}^{e}).

Node Attribute Denoising [13]. Different from Node Attribute Masking, this pretext task aims to add noise to the node features to obtain a noisy node feature matrix 𝐗^=𝐗+N(𝟎,𝚺)\widehat{\mathbf{X}}=\mathbf{X}+N(\mathbf{0},\mathbf{\Sigma}), and then ask the model to reconstruct the clean node features 𝐗\mathbf{X}. More formally,

ssl(θ,𝐀,𝐗^)=1Nvi𝒱fθ(𝐀,𝐗^)vi𝐱i2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2} (58)

where adding noise is only one means of corrupting the image, in addition to blurring, graying, etc. Inspired by this, it can use arbitrary corruption operations 𝒞()\mathcal{C}(\cdot) to obtain the corrupted features and then force the model to reconstruct them. Different from Node Attribute Denoising, which reconstructs raw features from noisy inputs, Node Embedding Denoising aims to reconstructs clean node features 𝐗\mathbf{X} from noisy embeddings 𝐇^=𝐇+N(𝟎,𝚺)\widehat{\mathbf{H}}=\mathbf{H}+N(\mathbf{0},\mathbf{\Sigma}).

Adjacency Matrix Reconstruction [14]. The graph adjacency matrix is one of the most important information in graph data, which stores the graph structure information and the relations between nodes. This pretext task randomly perturbs parts of the edges in a graph 𝐀\mathbf{A} to obtain 𝐀^\widehat{\mathbf{A}}, then requires the model to reconstruct the adjacency matrix of the input graph. More formally,

ssl(θ,𝐀^,𝐗)=1N2i,j(𝐀¯i,j𝐀i,j)2\mathcal{L}_{ssl}\left(\theta,\widehat{\mathbf{A}},\mathbf{X}\right)=\frac{1}{N^{2}}\sum_{i,j}\left(\overline{\mathbf{A}}_{i,j}-\mathbf{A}_{i,j}\right)^{2} (59)

where 𝐀¯=fθ(𝐀^,𝐗)\overline{\mathbf{A}}=f_{\theta}(\widehat{\mathbf{A}},\mathbf{X}). During the training process, since the adjacency matrix 𝐀\mathbf{A} is usually a sparse matrix, it can also use cross-entropy instead of MAE as loss in practice.

Refer to caption
(a) Node Property Prediction
Refer to caption
(b) Context-based Prediction
Refer to caption
(c) Self-training
Refer to caption
(d) Domain Knowledge-based
Figure 5: A comparison of predictive learning methods. Categorized by how the labels are obtained, we summarize predictive methods for graph data into four categories: node property prediction, context-based prediction, self-training, and domain knowledge-based prediction. Fig. (a): the node property prediction pre-calculates the node properties, such as node degree, and used them as self-supervised labels. Fig. (b): for the context-based prediction, the local or global contextual information in the graph, such as the shortest path length between nodes, can be extracted as labels to help with self-supervised learning. Fig. (c): The self-learning method applies algorithms such as unsupervised clustering to obtain pseudo-labels and then updates the pseudo-label set of the previous stage based on the prediction results or losses. Fig. (d): for the domain knowledge-based prediction, the domain knowledge, such as expert knowledge or specialized tools, can be used in advance to obtain informative labels.

4.2 Graph Autoregressive

The autoregressive model is a linear regression model that uses a combination of random variables from previous moments to represent random variables at a later moment.

GPT-GNN [32]. In recent years, the idea of GPT [6] has also been introduced into the GNN domain. For example, GPT-GNN proposes an autoregressive framework to perform node and edge reconstruction on given graph iteratively. Given a graph gt=(𝐀t,𝐗t)g_{t}=(\mathbf{A}_{t},\mathbf{X}_{t}) with its nodes and edges randomly masked in iteration tt, GPT-GNN generates one masked node XiX_{i} and its connected edges EiE_{i} to obtain a updated graph gt+1=(𝐀t+1,𝐗t+1)g_{t+1}=(\mathbf{A}_{t+1},\mathbf{X}_{t+1}) and optimizes the likelihood of the node and edges generation in the next iteration t+1t+1, with the learning objective defined as

pθ(𝐗t+1,𝐀t+1𝐗t,𝐀t)\displaystyle p_{\theta}\left(\mathbf{X}_{t+1},\mathbf{A}_{t+1}\mid\mathbf{X}_{t},\mathbf{A}_{t}\right) (60)
=\displaystyle= opθ(Xi,Ei¬oEio,𝐗t,𝐀t)pθ(Eio𝐗t,𝐀t)\displaystyle\sum_{o}p_{\theta}\left(X_{i},E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)\cdot p_{\theta}\left(E_{i}^{o}\mid\mathbf{X}_{t},\mathbf{A}_{t}\right)
=\displaystyle= 𝔼o[pθ(Xi,Ei¬oEio,𝐗t,𝐀t)]\displaystyle\mathbb{E}_{o}\big{[}p_{\theta}\left(X_{i},E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)\big{]}
=\displaystyle= 𝔼o[pθ(𝐗t+1Eio,𝐗t,𝐀t)pθ(Ei¬oEio,𝐗t+1,𝐀t)]\displaystyle\mathbb{E}_{o}\left[p_{\theta}\big{(}\mathbf{X}_{t+1}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)p_{\theta}(E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t+1},\mathbf{A}_{t})\big{]}

where oo is a variable to denote the index vector of all edges within EtE_{t} in the iteration tt. Thus, EtoE_{t}^{o} denotes the observed edges in the iteration tt, and Ei¬oE_{i}^{\neg o} denotes the the masked edges (to be generated) in the iteration t+1t+1. Finally, the graph generation process is factorized into a node attribute generation step pθ(𝐗t+1Eio,𝐗t,𝐀t)p_{\theta}\left(\mathbf{X}_{t+1}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right) and an edge generation step pθ(Ei¬oEio,𝐗t+1,𝐀t)p_{\theta}(E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t+1},\mathbf{A}_{t}). In practice, GPT-GNN performs node and edge generation iteratively.

5 Predictive Learning

The contrastive methods deal with the inter-data information (data-data pairs), the generative methods focus on the intra-data information, while the predictive methods aim to self-generate informative labels from the data as supervision and handle the data-label relationships. Categorized by how labels are obtained, we summarize predictive methods into four categories: (1) Node Property Prediction. The properties of nodes, such as node degree, are pre-calculated and used as self-supervised labels to perform prediction tasks. (2) Context-based Prediction. Local or global contextual information in the graph can be extracted as labels to aid self-supervised learning, e.g., by predicting the shortest path length between nodes, the model can capture long-distance dependencies, which is beneficial for downstream tasks such as link prediction. (3) Self-Training. Learning with the pseudo-labels obtained from the prediction or clustering in a previous stage or even randomly assigned. (4) Domain Knowledge-based Prediction. Expert knowledge or specialized tools are used in advance to analyze graph data (e.g., biological or chemical data) to obtain informative labels. A comparison of four predictive methods is shown in Fig. 5. In this section, due to space limitations, we present only some representative predictive methods and place those relatively less important works in Appendix C.

5.1 Node-Property Prediction (NP)

An effective way to perform predictive learning is to take advantage of the extensive implicit numerical properties within the graph, e.g. node properties, such as node degree and local clustering coefficient. Specifically, it first defines a mapping Ω:𝒱𝒴\Omega:\mathcal{V}\rightarrow\mathcal{Y} to denote the extraction of statistical labels yi=Ω(𝐀,𝐗)viy_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}} for each node viv_{i} from graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}). The learning objective is then formulated as

ssl(θ,𝐀,𝐗)=1Nvi𝒱(fθ(𝐀,𝐗)viyi)2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-y_{i}\Big{)}^{2} (61)

where fθ(𝐀,𝐗)vif_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}} is the predicted label of node viv_{i}. With different node properties, the mapping function Ω()\Omega(\cdot) can have different designs. If it use node degree as a local node property for self-supervision, we have yi=Ω(𝐀,𝐗)vi=j=1N𝐀i,jy_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}}=\sum_{j=1}^{N}\mathbf{A}_{i,j}. For the local clustering coefficients, we have

yi=Ω(𝐀,𝐗)vi=2|{(vm,vn)|vm𝒩i,vn𝒩i}||𝒩i|(˙|𝒩i|1)y_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}}=\frac{2\big{|}\{(v_{m},v_{n})|v_{m}\in\mathcal{N}_{i},v_{n}\in\mathcal{N}_{i}\}\big{|}}{|\mathcal{N}_{i}|\dot{(}|\mathcal{N}_{i}|-1)} (62)

where the local clustering coefficient is a local coefficient describing the level of node aggregation in a graph. Beyond the above two properties, any other node property (or even a combination of them) can be used as statistical labels to perform the pretext task of Node Property Prediction.

5.2 Context-based Prediction (CP)

Apart from Node Property Prediction, the underlying graph structure information can be further explored to construct a variety of regression-based or classification-based pretext tasks and thus provide self-supervised signals. We refer to this branch of methods as context-based predictive methods because it generally explores contextual information.

S2GRL\text{S}^{2}\text{GRL} [59]. Motivated by the observation that two arbitrary nodes in a graph can interact with each other through paths of different lengths, S2GRL\text{S}^{2}\text{GRL} treats the contextual position of one node relative to the other as a source of free and effective supervisory signals. Specifically, it defines the kk-hop context of node viv_{i} as 𝒞ik={vj|d(vi,vj)=k}(k=1,2,,K)\mathcal{C}_{i}^{k}=\{v_{j}|d(v_{i},v_{j})=k\}(k=1,2,\cdots,K), where d(vi,vj)d(v_{i},v_{j}) is the shortest path length between node viv_{i} and node vjv_{j}. In this way, for each target node viv_{i}, if a node vj𝒞ikv_{j}\in\mathcal{C}_{i}^{k}, then the hop count kk (relative contextual position) will be assigned to node vjv_{j} as pseudo-label yi,j=ky_{i,j}=k. The learning objective is defined as predicting the hop count between pairs of nodes, as follows

ssl(θ,ω,𝐀,𝐗)\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right) =1NKvi𝒱k=1Kvj𝒞ik(fw(\displaystyle=\frac{1}{NK}\sum_{v_{i}\in\mathcal{V}}\sum_{k=1}^{K}\sum_{v_{j}\in\mathcal{C}_{i}^{k}}\ell\Big{(}f_{w}\big{(} (63)
fθ(𝐀,𝐗)vi,fθ(𝐀,𝐗)vj),k)\displaystyle f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\big{)},k\Big{)}

where ()\ell(\cdot) denotes the cross entropy loss and fω()f_{\omega}(\cdot) linearly maps the input to a 1-dimension value. Compared with the task of S2GRL\text{S}^{2}\text{GRL}, the PairwiseDistance [10] has truncated the shortest path longer than 4, mainly to avoid the excessive computational burden and to prevent very noisy ultra-long pairwise distances from dominating the optimization.

PairwiseAttrSim [10]. Due to the neighborhood-based message passing mechanism, the learned representations of two similar nodes in the graph are not necessarily similar, as opposed to two identical images that will yield the same representations in the image domain. Though we would like to utilize local neighborhoods in GNNs to enhance node feature transformation, we still wishes to preserve the node pairwise similarity to some extent, rather than allowing a node’s neighborhood to drastically change it. Thus, the pretext task of PairwiseAttrSim can be established to achieve node similarity preservation. Specifically, it first samples node pairs with the KK highest and lowest similarities 𝒮i,h\mathcal{S}_{i,h} and 𝒮i,s\mathcal{S}_{i,s} for node viv_{i}, given by

𝒮i,h\displaystyle\mathcal{S}_{i,h} ={(vi,vj)sij in top-K of {sik}k=1,kiN}\displaystyle=\left\{\left(v_{i},v_{j}\right)\mid s_{ij}\text{ in top-K of }\left\{s_{ik}\right\}_{k=1,k\neq i}^{N}\right\} (64)
𝒮i,l\displaystyle\mathcal{S}_{i,l} ={(vi,vj)sij in bottom-K of {sik}k=1,kiN}\displaystyle=\left\{\left(v_{i},v_{j}\right)\mid s_{ij}\text{ in bottom-K of }\left\{s_{ik}\right\}_{k=1,k\neq i}^{N}\right\}

where si,js_{i,j} measures the node feature similarity between node viv_{i} and node vjv_{j} (according to cosine similarity). Let 𝒮i=𝒮i,h𝒮i,h\mathcal{S}_{i}=\mathcal{S}_{i,h}\cup\mathcal{S}_{i,h}, the learning objective can then be formulated as a regression problem, as follows

ssl(θ,ω,𝐀,𝐗)\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right) =12NKvi𝒱(vm,vn)𝒮i(fw(\displaystyle=\frac{1}{2NK}\sum_{v_{i}\in\mathcal{V}}\sum_{(v_{m},v_{n})\in\mathcal{S}_{i}}\Big{(}f_{w}\big{(} (65)
fθ(𝐀,𝐗)vm,fθ(𝐀,𝐗)vn)sm,n)2\displaystyle f_{\theta}(\mathbf{A},\mathbf{X})_{v_{m}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{n}}\big{)}-s_{m,n}\Big{)}^{2}

where fω()f_{\omega}(\cdot) linearly maps the input to a 1-dimension value.

Distance2Clusters [10]. The PairwiseAttrSim applies a sampling strategy to reduce the time complexity, but still involves sorting the node similarities, which is a very time-consuming operation. Inspired by various unsupervised clustering algorithms [60, 61, 62, 63, 64, 65, 66, 67, 68], if a set of clusters can be pre-obtained, the PairwiseAttrSim can be further simplified to predict the shortest path from each node to the anchor nodes associated with cluster centers, resulting in a novel pretext task - Distance2Clusters. Specifically, it first partitions the graph into KK clusters {C1,C2,,CK}\{C_{1},C_{2},\cdots,C_{K}\} by applying some classical unsupervised clustering algorithms. Inside each cluster CkC_{k} , the node with the highest degree will be taken as the corresponding cluster center, denoted as ckc_{k} (1kK1\leq k\leq K). Then it can calculate the distance 𝐝iK\mathbf{d}_{i}\in\mathbb{R}^{K} from node viv_{i} to cluster centers {ck}k=1K\{c_{k}\}_{k=1}^{K}. The learning objective of Distance2Clusters is defined as

ssl(θ,𝐀,𝐗)=1Nvi𝒱fθ(𝐀,𝐗)vi𝐝i2\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{d}_{i}\Big{\|}^{2} (66)

Meta-path Prediction [69]. A meta-path of length ll is a sequence of nodes connected with heterogeneous edges, i.e., v1t1v2t2tlvlv_{1}\stackrel{{\scriptstyle t_{1}}}{{\longrightarrow}}v_{2}\stackrel{{\scriptstyle t_{2}}}{{\longrightarrow}}\ldots\stackrel{{\scriptstyle t_{l}}}{{\longrightarrow}}v_{l}, where tl𝒯et_{l}\in\mathcal{T}^{e} denote the type of ll-th edge in the meta-path. Given a set of node pair 𝒮\mathcal{S} sampled from the heterogeneous graph and KK pre-defined meta-path types \mathcal{M}, this pretext task aims to predict if the two nodes (vi,vj)𝒮(v_{i},v_{j})\in\mathcal{S} are connected by one of the meta-path type mm\in\mathcal{M}. Finally, the predictions of the KK meta-paths are formulated as KK binary classification tasks, as follows

ssl(θ,𝐀,𝐗)\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right) =1K|𝒮|m(vi,vj)𝒮\displaystyle=\frac{1}{K|\mathcal{S}|}\sum_{m\in\mathcal{M}}\sum_{(v_{i},v_{j})\in\mathcal{S}} (67)
(fw(fθ(𝐀,𝐗)vi,fθ(𝐀,𝐗)vj),𝐘i,jm)\displaystyle\ell\Big{(}f_{w}\big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\big{)},\mathbf{Y}_{i,j}^{m}\Big{)}

where ()\ell(\cdot) deontes the cross entropy loss, and 𝐘i,jm\mathbf{Y}_{i,j}^{m} is the ground-truth label where 𝐘i,jm=1\mathbf{Y}_{i,j}^{m}=1 if there exits a meta-path mm between node viv_{i} and node vjv_{j}, otherwise 𝐘i,jm=0\mathbf{Y}_{i,j}^{m}=0.

SLiCE [70]. Different from the pretext task of Meta-path Prediction [69] that requires pre-defined mate-paths, SLiCE automatically learns the composition of different meta-paths for a specific task. Specifically, it first samples a set of nodes 𝒮\mathcal{S} from the node set 𝒱\mathcal{V}. Given a node in vi𝒮v_{i}\in\mathcal{S}, it generates a context subgraph gi=(𝐀i,𝐗i)g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i}) around viv_{i} and encodes the context as a low-dimensional embedding matrix 𝐇i\mathbf{H}_{i}. Then it randomly masks a node vimv_{i}^{m} in graph gig_{i} for prediction. Therefore, the pretext task aims to maximize the probability of observing this masked node vimv_{i}^{m} based on the context gig_{i},

ssl(θ,𝐀,𝐗)=vi𝒮vimgip(vim𝐇i,θ)\begin{split}\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\prod_{v_{i}\in\mathcal{S}}\prod_{v_{i}^{m}\in g_{i}}p\left(v_{i}^{m}\mid\mathbf{H}_{i},\theta\right)\end{split} (68)

where p(θ)p(\cdot\mid\theta) can in practice be approximated by a graph neural networks model fθ()f_{\theta}(\cdot) parameterized by θ\theta.

Distance2Labeled [10]. Recent work provides deep insight into existing self-supervised pretext tasks that utilize only attribute and structure information and finds that they are not always beneficial in improving the performance of downstream tasks, possibly because the information mined by the pretext tasks may have been fully exploited during the message passing by the GNN model. Thus, given partial information about downstream tasks, such as a small set of labeled nodes, we can explore label-specific self-supervised tasks. For example, we directly modify the pretext task of Distance2Cluster by combining label information to create a new pretext task - Distance2Labeled. Specifically, it first calculates the average, minimum, and maximum shortest path length from node viv_{i} to all labeled nodes in class {Ck}k=1K\{C_{k}\}_{k=1}^{K}, resulting in a distance vector 𝐝i3K\mathbf{d}_{i}\in\mathbb{R}^{3K}. Finally, the learning objective of Distance2Labeled can be formulated as a distance regression problem, as follows

ssl(θ,𝐀,𝐗)=1Nvi𝒱fθ(𝐀,𝐗)vi𝐝i2\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{d}_{i}\Big{\|}^{2} (69)

Compared with Distance2Cluster, Distance2Labeled utilizes task-specific label information rather than additional unsupervised clustering algorithms to find cluster centers, showing advantages in both efficiency and performance.

Refer to caption
Figure 6: An summary of graph self-supervised learning (SSL) methods. We categorize them into three branches: contrastive, generative, and predictive. For contrastive methods, they contrast different views and deal with data-data pairs (inter-data) information, and we further categorize it from three aspects: augmentation strategy, pretext task, and objective. In terms of augmentation strategy, it can be divided into four major categories: feature-based augmentation, structure-based augmentation, sampling-based augmentation, and adaptive augmentation. From the perspective of pretext tasks, it can be divided into same-scale contrasting and cross-scale contrasting. The same-scale contrasting includes Local-Local (L-L), Context-Cotext (C-C), and Global-Global (G-G) methods, while the cross-scale contrasting includes the Local-Global (L-G), Local-Cotext (L-C), and Context-Global (C-G) methods. For generative methods, they focus on the intra-data information and can be divided into Autoencoding and Autoregressive methods. For predictive methods, it handles the data-label relationship, which can be further divided into four major categories: Node Property Prediction (NP), Content-based Prediction (CP), Self-Training (ST), and Domain Knowledge-based Prediction (DK).

5.3 Self-Training (ST)

For self-training methods, the prediction results from the previous stage can be used as labels to guide the training of next stage, thus achieving self-training in an iterative way.

Multi-stage Self-training [71]. This pretext task is proposed to leverage the abundant unlabeled nodes to help training. Given both the labeled set 𝒟Lt\mathcal{D}_{L}^{t} and unlabeled set 𝒟Ut\mathcal{D}_{U}^{t} in the iteration step tt, the graph encoder fθ()f_{\theta}(\cdot) is first trained on the labeled set 𝒟Lt\mathcal{D}_{L}^{t}, as follows

node(θ,𝐀,𝐗,𝒟Lt)=(vi,yit)𝒟Lt(fθ(𝐀,𝐗)vi,yit)\mathcal{L}_{node}\left(\theta,\mathbf{A},\mathbf{X},\mathcal{D}_{L}^{t}\right)=\sum_{\left(v_{i},y_{i}^{t}\right)\in\mathcal{D}_{L}^{t}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}^{t}\Big{)} (70)

and then applied to make predictions 𝒴^t={y^itvi𝒱Ut}\widehat{\mathcal{Y}}^{t}=\{\widehat{y}_{i}^{t}\mid v_{i}\in\mathcal{V}_{U}^{t}\} on the unlabeled set 𝒱Ut\mathcal{V}_{U}^{t}. Then the predicted labels (as well as corresponding nodes) with KK-top high confidence

𝒟Nt={(vi,y^it)|y^it in top-K confidence of 𝒴^t}\mathcal{D}_{N}^{t}=\left\{(v_{i},\widehat{y}_{i}^{t})|\mid\widehat{y}_{i}^{t}\text{ in top-K confidence of }\widehat{\mathcal{Y}}^{t}\right\} (71)

are considered as the pseudo-labels and moved to the labeled node set 𝒟Lt\mathcal{D}_{L}^{t} to obtain an updated labeled set 𝒟Lt+1=𝒟Lt𝒟Nt\mathcal{D}_{L}^{t+1}=\mathcal{D}_{L}^{t}\bigcup\mathcal{D}_{N}^{t} and an updated unlabeled set 𝒟Ut+1=𝒟Ut/𝒟Nt\mathcal{D}_{U}^{t+1}=\mathcal{D}_{U}^{t}/\mathcal{D}_{N}^{t}. Finally, a fresh graph encoder is trained on the updated labeled set 𝒟Lt+1\mathcal{D}_{L}^{t+1}, and the above operations are performed multiple times in an iterative manner.

Node Clustering or Partitioning [12]. Compared to Multi-stage Self-training, the pretext task of Node Clustering pre-assigns a pseudo-label yiy_{i}, e.g., the cluster index, to each node viv_{i} by some unsupervised clustering algorithms. The learning objective of this pretext task is then formulated as a classification problem, as follows

ssl(θ,𝐀,𝐗)=1Nvi𝒱(fθ(𝐀,𝐗)vi,yi)\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}\Big{)} (72)

When node attributes are not available, another choice to obtain pseudo-labels is based on the topology of a given graph structure or adjacency matrix. Specifically, graph partitioning [72, 73] is to partition the nodes of a graph into roughly equal subsets, such that the number of edges connecting nodes across subsets is minimized. To absorb the advantages of both attributive- and structural-based clustering, CAGNN [74] combines the node clustering and node partitioning to proposed a new pretext task. Concretely, it first assigns cluster indices as pseudo labels but follows a topology refining process that refines the clusters by minimizing the inter-cluster edges.

M3S [75]. Combining Multi-stage Self-training with Node Clustering, M3S applies DeepCluster [68] and the alignment mechanism as a self-checking mechanism, thus providing stronger self-supervision. Specifically, a KK-mean clustering algorithm is performed on node representations 𝐇(t)\mathbf{H}^{(t)} learned in the iteration step tt (rather than 𝐗\mathbf{X}) and the clustered pseudo-label 𝒟Nt\mathcal{D}_{N}^{t} that matches the prediction of the classifier in the last iteration step t1t-1 will added to the labeled set to obtain an updated labeled set 𝒟Lt+1=𝒟Lt𝒟Nt\mathcal{D}_{L}^{t+1}=\mathcal{D}_{L}^{t}\bigcup\mathcal{D}_{N}^{t}. Finally, a fresh model will be trained on the labeled set 𝒟Lt+1\mathcal{D}_{L}^{t+1} with the objective defined as Equ. 70.

Cluster Preserving [17]. An important characteristic of real-world graphs is the cluster structure, so we can consider the cluster preservation as a self-supervised pretext task. The unsupervised clustering algorithms are first applied to group nodes in a graph into KK non-overlapping clusters {Ck}k=1K\{C_{k}\}_{k=1}^{K}, then the cluster prototypes can be obtained by ck=AGGRATE({fθ(𝐀,𝐗)viviCk})c_{k}=\text{AGGRATE}(\{f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}\mid v_{i}\in C_{k}\}). The mapping function gω()g_{\omega}(\cdot) is used to estimate the similarity of node viv_{i} with the cluster prototype ckc_{k}, e.g., the probability y^i,k\widehat{y}_{i,k} that node viv_{i} belongs to cluster CkC_{k} is defined as follows,

y^i,k=exp(gω(fθ(𝐀,𝐗)vi,ck))k=1Kexp(gω(fθ(𝐀,𝐗)vi,ck))\widehat{y}_{i,k}=\frac{\exp\Big{(}g_{\omega}(f_{\theta}\big{(}\mathbf{A},\mathbf{X})_{v_{i}},c_{k}\big{)}\Big{)}}{\sum_{k=1}^{K}\exp\Big{(}g_{\omega}(f_{\theta}\big{(}\mathbf{A},\mathbf{X})_{v_{i}},c_{k}\big{)}\Big{)}} (73)

Finally, the objective of Cluster Preserving is defined as

ssl(θ,𝐀,𝐗)=1Nvi𝒱k=1Kyi,klog(y^i,k)\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=-\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\sum_{k=1}^{K}y_{i,k}\log(\widehat{y}_{i,k}) (74)

where the ground-truth label yi,k=1y_{i,k}=1 if node viv_{i} is grouped into cluster CkC_{k}, otherwise yi,k=0y_{i,k}=0.

5.4 Domain Knowledge-based Prediction (DK)

The formation of real-world graphs usually obeys specific rules, e.g., the links between atoms in molecular graphs are bounded by valence bonding theory, while cross-cited papers in citation networks generally have the same topic or authors. Therefore, extensive expert knowledge can be incorporated as a prior into the design of pretext tasks.

Contextual Molecular Property Prediction [76] incorporates domain knowledge about biological macromolecules to design molecule-specific pretext tasks. Given a node viv_{i}, it samples its kk-hop neighborhood nodes and edges as a local subgraph and then extracts statistical properties of this subgraph. Specifically, it counts the number of occurrence of (node, edge) pairs around the center node viv_{i} and then list all the node-edge count terms in alphabetical order, which constitutes the final property, e.g., C_\_N-DOUBLE1_\_O-SINGLE1 in Fig. 5 (d). With plenty of context-aware properties 𝒫={pk}k=1K\mathcal{P}=\{p_{k}\}_{k=1}^{K} pre-defined, the contextual property prediction can be defined as a multi-class prediction problem with one class corresponds to one contextual property, as follows

ssl(θ,𝐀,𝐗)=1Nvi𝒱(fθ(𝐀,𝐗)vi,yi)\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}\Big{)} (75)

where ()\ell(\cdot) denotes the cross entropy loss, and yi=ky_{i}=k if the molecular property of node viv_{i} is pkp_{k}.

Graph-level Motif Prediction [76]. Motifs are recurrent sub-graphs among the input graph, which are prevalent in molecular graphs. One important class of motifs in molecules are functional groups, which encodes the rich domain knowledge of molecules and can be easily detected by professional softwares, such as RDKit. Suppose considering the presence of KK motifs {mk}k=1K\{m_{k}\}_{k=1}^{K} in the molecular data, then for one specific molecule graph gi=(𝐀i,𝐗i)𝒢g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}, it detects whether each motif shows up in gig_{i} and use it as the label 𝐲iK\mathbf{y}_{i}\in\mathbb{R}^{K}. Specifically, if mkm_{k} shows up in gig_{i}, the kk-th elements 𝐲i,k\mathbf{y}_{i,k} will be set to 1, otherwise 0. Formally, the learning objective of the motif prediction task is formulated as a multi-label classification problem, as follows

ssl(γ,𝒢)=1|𝒢|gi𝒢(fγ(𝐀i,𝐗i),𝐲i)\mathcal{L}_{ssl}\left(\gamma,\mathcal{G}\right)=\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\ell\Big{(}f_{\gamma}(\mathbf{A}_{i},\mathbf{X}_{i}),\mathbf{y}_{i}\Big{)} (76)

where ()\ell(\cdot) deontes the binary cross entropy loss.

6 Summary of the Implementation

A summary of the surveyed works is presented in Fig. 6, and Appendix D lists their properties, including graph property, pretext task, augmentation, objective function, training strategy, and publication year. Furthermore, we show in Appendix E the implementation details of surveyed works, such as downstream tasks, evaluation metrics, and datasets.

6.1 Downstream Tasks

The graph SSL methods are generally evaluated on three levels of graph tasks: node-level, link-level, and graph-level. Among them, the graph-level tasks are usually performed on multiple graphs in the form of inductive learning. Commonly used graph-level tasks include graph classification and graph regression. The link-level tasks mainly focus on link prediction, that is, given two nodes, the objective is to predict whether a link (edge) exists between them. On the other hand, the node-level tasks are generally performed on a large graph in the form of transductive learning. Depending on whether labels are provided, it can be divided into three categories: node regression, node classification, and node clustering. The node classification and node regression are usually performed with partial known labels. Instead, the node clustering is performed in a more challenging unsupervised manner and adopted when the performance of the node classification is not sufficiently distinguishable.

6.2 Evaluation Metrics

For graph classification tasks, the commonly used evaluation metrics include ROC-AUC and Accuracy (Acc); while for graph regression tasks, Mean Absolute Error (MAE) is used. In terms of link prediction tasks, ROC-AP, ROC-PR, and ROC-AUC are usually used as evaluation metrics. Besides, node regression tasks are usually evaluated by metrics including MAE, Mean Square Error (MSE), and Mean Absolute Percentage Error (MAPE). In addition to Accuracy, node classification tasks also adopt F1-score for single-label classification and Micro-F1 (or Macro-F1) for multi-label classification. Moreover, node clustering tasks often adopt the same metrics used to evaluate the unsupervised clustering, such as Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Accuracy, etc.

6.3 Datasets

The statistics of a total of 41 datasets are available in Appendix F. Commonly used datasets for graph self-supervised learning tasks can be divided into five categories: citation networks, social networks, protein networks, molecule graphs, and others. (1) Citation Networks. In citation networks, nodes usually denote papers, node attributes are some keywords in papers, edges denote cross-citation, and categories are topics of papers. Note that nodes in the citation networks may also sometimes indicate authors, institutions, etc. (2) Social Networks. The social network datasets consider entities (e.g., users or authors) as nodes, their interests and hobbies as attributes, and their social interactions as edges. The widely used social network datasets for self-supervised learning are mainly some classical graph datasets, such as Reddit [8], COLLAB [77]. (3) Molecule Graphs. In molecular graphs, nodes represent atoms in the molecule, the atom index is indicated by the node attributes, and edges represent bonds. Molecular graph datasets typically contain multiple graphs and are commonly used for tasks such as graph classification and graph regression, e.g., predicting molecular properties. (4) Protein Networks. The protein networks can be divided into two main categories - Protein Molecule Graph and Protein Interaction Network - based on the way they are modeled. The Protein Molecule Graph is a particular type of molecule graph, where nodes represent amino acids, and an edge indicates the two connected nodes are less than 6 angstroms apart. The commonly used datasets include PROTEINS [78] and D&D [78], used for chemical molecular property prediction. The other branch is Protein Interaction Networks, where nodes denote protein molecules, and edges indicate their interactions. The commonly used dataset is PPI [79], used for graph biological function prediction. (5) Other Graphs. In addition to the four types of datasets mentioned above, there are some datasets that are less common or difficult to categorize, such as image, traffic, and co-purchase datasets.

6.4 Codes in Github

The open-source codes are beneficial to the development of the deep learning community. A summary of the open-source codes of 71 surveyed works is presented in Appendix G, where we provide hyperlinks to their open-source codes. Most of these methods are implemented on GPUs based on Pytorch or Tensorflow libraries. Moreover, we have created a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning to summarize the latest advances in graph SSL, which will be updated in real-time as more papers and their codes become available.

6.5 Experimental Study

To make a fair comparison, we select two important downstream tasks, node classification and graph classification, and provide the classification performance of various classical algorithms on 15 commonly used graph datasets in Appendix H. Due to space limitations, please refer to the appendix for more experimental results and analysis.

7 Discussion

We begin with some discussion and summary of the connections and developments between various methods. To present a clearer picture of the development lineage of various graph SSL methods, we provide a complete timeline in Appendix I, listing the publication dates of some key milestones. Besides, we provide the inheritance connections between methods to show how they are developed. Furthermore, we provide short descriptions of contributions to some seminal works to highlight their importance.

DGI [9] is a pioneering work for graph contrastive learning, originally designed specifically for node classification on attribute graphs, and later extended to other types of graphs, resulting in new variants such as HDGI [31] for heterogeneous graphs, STDGI [28] for spatial-temporal graphs, DMGI [80] for multiplexed graphs, and BiGI [37] for bipartite graphs. Besides, InfoGraph [49] extends DGI to global-context contrasting, achieving state-of-the-art performance on multiple graph classification datasets. With the focus shifted from local nodes and global graphs to subgraphs, GCC [38] proposes the first subgraph-level context-context contrasting framework, where subgraphs sampled from the same graph are considered as the positive pair.

Different from DGI, GRACE [26] focuses on contrasting views at the node-level by generating multiple augmented graphs through handcrafted augmentations and then encouraging consistency between the same nodes in different views. GCA [40] adopts a similar framework to GRACE, but focuses on designing the adaptive augmentation strategy. Similarly, GROC [43] claims that gradient information can be used to guide data augmentation and proposes a gradient-based graph topology augmentation that further improves the performance of GRACE and GCA. The same local-local contrasting as GRACE, but BGRL [27] is inspired by BYOL [81] and explores for the first time whether negative samples are a must for graph contrastive learning.

The focus on three different levels of nodes, edges, and structures has led to different generative methods such as Graph Completion [12], Edge Feature Masking [11], and Adjacency Matrix Reconstruction [14], respectively. Moreover, due to their simplicity and effectiveness, these methods have been widely used in algorithms such as GPT-GNN [32] and recommendation applications such as Pretrain-Recsys [82]. In terms of predictive methods, the basic difference between methods is how to obtain pseudo-labels, and there are three main means: (1) numerical statistics, such as Node Property Prediction and PairwiseAttrSim [10]; (2) prediction results from the previous training stage, such as CAGNN [74] and M3S [75]; (3) domain knowledge, such as Molecular Property Prediction and Global Motif Prediction [76].

Discussion on Pros and Cons. Next, we will discuss the advantages and disadvantages of some classical algorithms on four aspects: innovation, accessibility, effectiveness (performance), and efficiency, based on which we divide existing algorithms into four categories: (1) Pioneering work. Representative works include DGI [9], InfoGraph [49], HDGI [31], STDGI [28], etc. These methods, for the first time, apply contrastive learning to a novel downstream task or graph type, showing promising innovations and inspiring many follow-up researches. However, as early attempts, they often perform relatively poorly on downstream tasks and with high computational complexity, compared to some subsequent works. (2) Knowledge-based work. There are some works that combine self-supervised learning techniques with prior knowledge to obtain excellent performance on downstream tasks. For example, Molecular Property Prediction [76] combines domain knowledge, i.e., molecular properties, while LCC [45] introduces label information into the computation of supervision signals. These knowledge-based methods usually achieve fairly good performance due to the introduction of additional information, but this also limits their applicability to other tasks and graph types. (3) There is some work aimed at building on existing work and pursuing state-of-the-art performance on a variety of datasets, among which representative works include K2SL [83], BGRL [27], and SUGAR [84]. While these works have achieved state-of-the-art performance on a variety of downstream tasks, they are largely incremental contributions to previous seminal work (e.g., DGI) and are relatively weak on innovation and accessibility. (4) Most generative and predictive methods are less effective than contrastive methods, but are generally very simple to implement, easy to combine with existing frameworks, have lower computational complexity, and exhibit better applicability and efficiency.

Pretext Tasks for Complex Types of Graph. Most of the existing work is focused on the design of pretext tasks, especially on attribute graphs, with little effort to other more complex graph types, such as spatial-temporal and heterogeneous graphs. Moreover, these pretext tasks usually utilize only node-level or graph-level features, limiting their ability to exploit richer information, such as temporal information in spatial-temporal graphs and relation information in heterogeneous graphs. As a result, how to design more suitable pretext tasks can be considered from three aspects: (1) designing graph type-specific pretext tasks that adaptively pick the most suitable tasks depending on the type of graph; (2) incorporating temporal or heterogeneous information (in the form of prior knowledge) into the pretext task design; (3) taking the automated design of pretext tasks as a new research topic from the perspective of automatic learning.

Lack of Theoretical Foundation. Despite the great success of graph SSL on various tasks, they mostly draw on the successful experience of SSL on CV and NLP domains. In other words, most existing graph SSL methods are designed with intuition, and their performance gains are evaluated by empirical experiments. The lack of sufficient theoretical foundations behind the design has led to both performance bottlenecks and poor explainability. Therefore, we believe that building a solid theoretical foundation for graph SSL from a graph theory perspective and minimizing the gap between the theoretical foundation and empirical design is also a promising future direction. For example, an important problem for graph SSL is whether mutual information maximization is the only means to achieve graph contrastive learning? Such problems have been explored in [54] for image data, but how to extend them to the graph domain is not yet available. In Sec. 3.4, in addition to MI estimators such as InforNCE, we have introduced some contrastive objectives that are not based on mutual information, such as triplet margin and quadruplet loss. However, how to theoretically analyze the connection between these losses and mutual information needs to be further explored.

Insufficient Augmentation Strategy. Recent advances in the field of visual representation learning [2, 3] are mainly attributed to a variety of data augmentation strategies, such as resize, rotation, coloring, etc. However, due to the inherent non-Euclidean nature of graph data, it is difficult to directly apply existing image-based augmentation to graphs. Moreover, most augmentation strategies on graphs are limited to adding/removing nodes and edges or their combination to achieve the asserted SOTA. To further improve the performance of SSL on graphs, it is a promising direction to design more efficient augmentation strategies. More importantly, the design of the augmentation strategy should follow some well-designed guidelines instead of relying entirely on subjective intuition. In summary, we argue that the design of graph augmentation should be based on the following four guidelines: (1) Applicability, graph augmentation should ideally be a plug-and-play module that can be easily combined with the existing self-supervised learning frameworks. (2) Adaptability, some work [10, 27] have pointed out that different datasets and task types may require different augmentations, so how to design the date-specific and task-specific augmentation strategy is a potential research topic. (3) Efficiency, data augmentation should be a lightweight module that does not bring a huge computational burden to the original implementation. (4) Dynamic, with the ability to dynamically update the augmentation strategy as the training proceeds.

Inefficient Negative Sampling Strategy. The selection of high-quality negative samples is a crucial issue. The most common sampling strategy is uniform sampling, but this has been shown to be very informative [85, 86, 87]. The problem of how to better obtain negative samples has been well explored in the field of computer vision. For example, [85] presents the debiased contrastive learning that directly corrects the sampling bias of negative samples. Besides, [86] takes advantage of mixup techniques to directly synthesize hard negative samples in the embedding space. Moreover, [87] develops a family of unsupervised sampling strategies for user-controllable negative sample selection. Despite the great success, these methods, specifically designed for image data, may be difficult to apply directly to non-Euclidean graph data. More importantly, accurate estimation of hard negative samples becomes more difficult when label information is not available. Therefore, how to reduce the gap between ideal and practical contrastive learning by a decent negative sampling strategy requires more exploration.

Lack of Explainability. Though existing graph SSL methods have achieved excellent results on various downstream tasks, we still do not know exactly what has been learned from self-supervised pretext tasks. Which of the feature patterns, significant structures, or feature-structure relationships has been learned by self-supervision? Is this learning explicit or implicit? Is it possible to find interpretable correspondences on the input data? These are important issues for understanding and interpreting model behavior but are missing in current graph SSL works. Therefore, we need to explore the interpretability of graph SSL and perform a deep analysis of model behavior to improve the generalization and robustness of existing methods for security- or privacy-related downstream tasks.

Margin from Pre-training to Downstream Tasks. Pre-training with self-supervised tasks and then using the pre-trained model for specific downstream tasks, either by fine-tuning or freezing the weights, is a common training strategy in graph SSL [82, 88, 89]. However, how shall we transfer the pre-trained knowledge to downstream tasks? Though numerous strategies have been proposed to address this problem in the CV and NLP domains [90], they are difficult to apply directly to graphs due to the inherent non-Euclidean structure of graphs. Therefore, it is an important issue to design graph-specific techniques to minimize the margin between pre-training and downstream tasks.

8 Conclusion

A comprehensive survey of the literature on graph self-supervised learning techniques is conducted in this paper. We develop a unified mathematical framework for graph SSL. Moreover, we summarize the implementation details in each work and show their similarities and differences. More importantly, we are the first survey to provide a detailed experimental study on self-supervised learning, setting the stage for the future development of graph SSL. Finally, we point out the technical limitations of the current research and provide promising directions for future work on graph SSL. We hope this survey will inspire follow-up researchers to focus on other important but easy-to-miss details such as theoretical foundations, explainability, etc., in addition to model performance on downstream tasks.

A. Contrastive Methods

In this section, we will continue with some contrastive methods for graph SSL, but to avoid over-redundancy, we will not provide detailed mathematical formulas for some relatively less important works. These methods include (1) methods that are essentially the same as the frameworks already presented in the main paper with only minor differences; (2) methods that have not been accepted and are only publicly available on platforms such as arxiv, OpenReview, etc.; (3) application methods, where graph SSL is only one of the adopted techniques or tricks and is not the focus of their works; (4) methods with only minor performance improvement on the benchmark datasets.

A.1 Global-Global Contrasting

Iterative Graph Self-Distillation (IGSD) [33] is a graph self-supervised learning paradigm that iteratively performs the teacher-student distillation with graph augmentations. Specifically, given a graph gi=(𝐀i,𝐗i)𝒢g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}, it performs a series of augmentation transformations 𝒯()\mathcal{T}(\cdot) to generate an augmented graph g~i=(𝐀~i,𝐗~i)=𝒯(𝐀i,𝐗i)\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i}). Then two graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) as well as a READOUT function are applied to obtain the graph-level representations 𝐡gi=READOUT(fθ1(𝐀i,𝐗i))\mathbf{h}_{g_{i}}=\text{READOUT}\left(f_{\theta_{1}}(\mathbf{A}_{i},\mathbf{X}_{i})\right) and 𝐡~g~i=READOUT(fθ2(𝐀~i,𝐗~i))\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}=\text{READOUT}(f_{\theta_{2}}(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})), respectively. Moreover, a prediction head gω()g_{\omega}(\cdot) is used in the student network to obtain gω(𝐡gi)g_{\omega}(\mathbf{h}_{g_{i}}) and gω(𝐡~g~i)g_{\omega}(\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}). To contrast representations 𝐡gi\mathbf{h}_{g_{i}} and 𝐡~g~i\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}, the symmetric consistency loss is defined as

𝒟(𝐡gi,𝐡~g~i)=gω(𝐡gi)𝐡~g~i2+gω(𝐡~g~i)𝐡gi2\mathcal{D}\left(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right)=\left\|g_{\omega}(\mathbf{h}_{g_{i}})-\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right\|^{2}+\left\|g_{\omega}(\widetilde{\mathbf{h}}_{\widetilde{g}_{i}})-\mathbf{h}_{g_{i}}\right\|^{2} (77)

Finally, the learning objective is defined as follows

maxθ1,ω1|𝒢|gi𝒢logexp(𝒟(𝐡gi,𝐡~g~i))j=1Nexp(𝒟(𝐡gi,𝐡~g~j))\max_{\theta_{1},\omega}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\log\frac{\exp\left(\mathcal{D}(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}})\right)}{\sum_{j=1}^{N}\exp\left(\mathcal{D}(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{j}})\right)} (78)

where the negative samples to contrast with 𝐡gi\mathbf{h}_{g_{i}} is Neg(𝐡gi)={𝐡~g~j}gj𝒢,jiNeg(\mathbf{h}_{g_{i}})=\{\widetilde{\mathbf{h}}_{\widetilde{g}_{j}}\}_{g_{j}\in\mathcal{G},j\neq i}. The parameter θ2\theta_{2} of teacher network fθ2()f_{\theta_{2}}(\cdot) are updated as an exponential moving average (EMA) of the student network parameters θ1\theta_{1}, as follows

θ2αθ2+(1α)θ1\theta_{2}\longleftarrow\alpha\theta_{2}+(1-\alpha)\theta_{1} (79)

where α[0,1)\alpha\in[0,1) is the momentum weight to control the evolving speed of θ2\theta_{2}.

Domain-Agnostic Contrastive Learning (DACL) [91] uses the Mixup noise to create positive and negative examples by mixing data samples either at the input or embedding spaces. Specifically, given a graph gi=(𝐀i,𝐗i)𝒢g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}, it applies a graph encoder fθ()f_{\theta}(\cdot) and a READOUT function to obtain the graph-level representation 𝐡gi=READOUT(fθ(𝐀i,𝐗i))\mathbf{h}_{g_{i}}=\text{READOUT}(f_{\theta}(\mathbf{A}_{i},\mathbf{X}_{i})). Then it performs the mixup transformations to generate two positive pairs , e.g., 𝐡gi(1)=λ1𝐡gi+(1λ1)𝐡gm\mathbf{h}^{(1)}_{g_{i}}=\lambda_{1}\mathbf{h}_{g_{i}}+(1-\lambda_{1})\mathbf{h}_{g_{m}} and 𝐡gi(2)=λ2𝐡gi+(1λ2)𝐡gn\mathbf{h}^{(2)}_{g_{i}}=\lambda_{2}\mathbf{h}_{g_{i}}+(1-\lambda_{2})\mathbf{h}_{g_{n}}, where λ1\lambda_{1} and λ2\lambda_{2} are sampled from a Gaussian distribution and 𝐡gm\mathbf{h}_{g_{m}} and 𝐡gn\mathbf{h}_{g_{n}} are randomly sampled from {𝐡gk}k=1,kiN\{\mathbf{h}_{g_{k}}\}_{k=1,k\neq i}^{N}, respectively. Finally, the learning objective is defined as

maxθ,ω1|𝒢|gi𝒢(gω(𝐡gi(1)),gω(𝐡gi(2)))\max_{\theta,\omega}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(g_{\omega}(\mathbf{h}^{(1)}_{g_{i}}),g_{\omega}(\mathbf{h}^{(2)}_{g_{i}})\right) (80)

where gω()g_{\omega}(\cdot) is a nonlinear prediction head, and the contrasted negative samples is Neg(gω(𝐡gi(1)))={gω(𝐡gj(c))}ji;c=1,2Neg\left(g_{\omega}(\mathbf{h}^{(1)}_{g_{i}})\right)=\{g_{\omega}(\mathbf{h}^{(c)}_{g_{j}})\}_{j\neq i;c=1,2}.

A.2 Local-Local Contrasting

KS2L [83] is a novel self-supervised knowledge distillation framework, with two complementary intra- and cross-model knowledge distillation modules. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies two linear mapping functions to obatain 𝐙(1)=gω1(𝐗)\mathbf{Z}^{(1)}=g_{\omega_{1}}(\mathbf{X}) and 𝐙(2)=gω2(𝐗)\mathbf{Z}^{(2)}=g_{\omega_{2}}(\mathbf{X}) and then uses two graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) to obtain node embedding matrices 𝐇(1)=fθ1(𝐀,𝐙(1))\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A},\mathbf{Z}^{(1)}) and 𝐇(2)=fθ2(𝐀,𝐙(2))\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A},\mathbf{Z}^{(2)}). Finally, the learning objective is defined as follows

maxθ1,θ2,ω1,ω212||ei,j[(𝐡i(1),𝐳j(1))+(𝐡i(1),𝐳j(2))]\max_{\theta_{1},\theta_{2},\omega_{1},\omega_{2}}\frac{1}{2|\mathcal{E}|}\sum_{e_{i,j}\in\mathcal{E}}\mathcal{MI}\left[\left(\mathbf{h}_{i}^{(1)},\mathbf{z}_{j}^{(1)}\right)+\left(\mathbf{h}_{i}^{(1)},\mathbf{z}_{j}^{(2)}\right)\right] (81)

where the first term is used for intra-model knowledge distillation, and the negative samples to contrast with 𝐡i(1)\mathbf{h}_{i}^{(1)} is Neg(𝐡i(1))={𝐳k(1)}ei,kNeg(\mathbf{h}_{i}^{(1)})=\{\mathbf{z}_{k}^{(1)}\}_{e_{i,k}\notin\mathcal{E}}. The second term is used for cross-model knowledge distillation, and the negative samples to contrast with 𝐡i(1)\mathbf{h}_{i}^{(1)} is Neg(𝐡i(1))={𝐳k(2)}ei,kNeg(\mathbf{h}_{i}^{(1)})=\{\mathbf{z}_{k}^{(2)}\}_{e_{i,k}\notin\mathcal{E}}.

CG3\text{C}\text{G}^{3} [92] adopts a similar framework to GRACE [26], but perfroms the same label-level contasting as LCC [45]. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies a localized graph convolution encoder fθ1()f_{\theta_{1}}(\cdot) and a hierarchical graph convolution encoder fθ2()f_{\theta_{2}}(\cdot) to generate their node embedding matrices 𝐇(1)=fθ1(𝐀,𝐗)\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A},\mathbf{X}) and 𝐇(2)=fθ2(𝐀,𝐗)\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A},\mathbf{X}). To incorporate the scarce yet valuable label information for training, it uses a supervised label contrastive loss as follows:

maxθ12Ni=1N[lc(1)+lc(2)]\max_{\theta}\frac{1}{2N}\sum_{i=1}^{N}\left[\mathcal{L}^{(1)}_{lc}+\mathcal{L}^{(2)}_{lc}\right] (82)

where lc(1)\mathcal{L}^{(1)}_{lc} and lc(2)\mathcal{L}^{(2)}_{lc} are defined as

lc(1)=logk=1m𝕀yk=yiexp(𝐡i(1)𝐡k(2))j=1mexp(𝐡i(1)𝐡j(2))\mathcal{L}^{(1)}_{lc}=\log\frac{\sum_{k=1}^{m}\mathbb{I}_{y_{k}=y_{i}}\cdot\exp\left(\mathbf{h}_{i}^{(1)}\cdot\mathbf{h}_{k}^{(2)}\right)}{\sum_{j=1}^{m}\exp\left(\mathbf{h}_{i}^{(1)}\cdot\mathbf{h}_{j}^{(2)}\right)} (83)
lc(2)=logk=1m𝕀yk=yiexp(𝐡i(2)𝐡k(1))j=1mexp(𝐡i(2)𝐡j(1))\mathcal{L}^{(2)}_{lc}=\log\frac{\sum_{k=1}^{m}\mathbb{I}_{y_{k}=y_{i}}\cdot\exp\left(\mathbf{h}_{i}^{(2)}\cdot\mathbf{h}_{k}^{(1)}\right)}{\sum_{j=1}^{m}\exp\left(\mathbf{h}_{i}^{(2)}\cdot\mathbf{h}_{j}^{(1)}\right)} (84)

where mm is the number of all labeled nodes, and 𝕀yk=yi\mathbb{I}_{y_{k}=y_{i}} is an indicator function to determine whether the label of node vkv_{k} is the same as that of node viv_{i}.

PT-DGNN [89] is proposed to perfrom pre-training on dynamic graph g=(𝐀(t),𝐗)g=(\mathbf{A}^{(t)},\mathbf{X}), where 𝐀i,j(t)=1\mathbf{A}^{(t)}_{i,j}=1 denotes the interaction between node viv_{i} and vjv_{j} at time tt, that is, tt is the timestamp of edge ei,je_{i,j}. It first applies the dynamic subgraph sampling (DySS) to obtain a subgraph gS=(𝐀s,𝐗s)g_{S}=(\mathbf{A}_{s},\mathbf{X}_{s}) for pre-training. DySS mainly includes three steps: 1) Randomly select ss nodes as the sampling initial points; 2) Put the first-order neighbors of these nodes into the candidate pool and save their timestamp as weight to calculate sampling probability; 3) Select final ss nodes according to sampling probability. Furthermore, it genetates an augmentated graph ga=(𝐀a,𝐗a)=𝒯(𝐀s,𝐗s)g_{a}=(\mathbf{A}_{a},\mathbf{X}_{a})=\mathcal{T}(\mathbf{A}_{s},\mathbf{X}_{s}) by performing the attribute masking and edge perturbation to obtain , where the masked node and edge set are denoted as 𝒮\mathcal{S} and \mathcal{M}, respectively. Then a shared graph encoder fθ()f_{\theta}(\cdot) is applied to obtain the node embedding matrix 𝐇=fθ(𝐀a,𝐗a)\mathbf{H}=f_{\theta}(\mathbf{A}_{a},\mathbf{X}_{a}). Finally, the objective of dynamic edge generation is defined as

maxθ1|𝒱a|vi𝒱aei,jlogexp(𝒟(𝐡i,𝐡j))vk𝒱aexp(𝒟(𝐡i,𝐡k))\max_{\theta}\frac{1}{|\mathcal{V}_{a}|}\sum_{v_{i}\in\mathcal{V}_{a}}\sum_{e_{i,j}\in\mathcal{M}}\log\frac{\exp\Big{(}\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})\Big{)}}{\sum_{v_{k}\in\mathcal{V}_{a}}\exp\Big{(}\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{k})\Big{)}} (85)

where 𝒱a\mathcal{V}_{a} is the node set of graph gag_{a}. The learning objective of the node attribute generation is defined as follows

minθ,ω1|𝒮|vi𝒮gω(𝐡i)𝐱i2\min_{\theta,\omega}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\big{\|}g_{\omega}(\mathbf{h}_{i})-\mathbf{x}_{i}\big{\|}^{2} (86)

where gω()g_{\omega}(\cdot) is a node-level nonlinear prediction head.

COAD [93] applies graph SSL to expert linking, which aims at linking any external information of persons to experts in AMiner. Specifically, it first pre-trains the model by local-local contrastive learning with the triplet margin loss and then fine-tunes the model by adversarial learning to improve the model transferability.

Contrast-Reg [29] is a lightweight local-local contrastive regularization term that adopts the InfoNCE loss to contrasts the node representation similarities of semantically similar (positive) pairs against those of negative pairs. Extensive theoretical analysis demonstrates that Contrast-Reg avoids the high scale of node representation norms and the high variance among them to improve the generalization performance.

C-SWM [94] models the structured environments, such as multi-object systems, as graphs, and then utilizes a local-local contrastive approach to perform the representation learning from environment interactions without supervision.

A.3 Local-Global Contrasting

High-order Deep Multiplex Infomax (HDMI) [30] is proposed to achieve higher-order mutual information maximization. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first applies an augmentation transformation 𝒯()\mathcal{T}(\cdot) to obtain g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then a shared graph encoder fθ()f_{\theta}(\cdot) is applied to obtain node embedding matrices 𝐇~=fθ(𝐀~,𝐗~)\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}) and 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}). Beside, a READOUT()\operatorname{READOUT}() function is used to obtain the graph-level representaion 𝐡g=READOUT(𝐇)\mathbf{h}_{g}=\operatorname{READOUT}(\mathbf{H}). Finally, the learning objective is defined as follows

maxθ,λ1Nvi𝒱[(𝐡g,𝐡i)+(𝐱i,𝐡i)+(𝐡i,𝐦i)]\displaystyle\max_{\theta,\lambda}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}[\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i}\right)+\mathcal{MI}\left(\mathbf{x}_{i},\mathbf{h}_{i}\right)+\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{m}_{i}\right)] (87)

where 𝐦i=σ(𝐖m[σ(𝐖h𝐱i);σ(𝐖g𝐡g)])\mathbf{m}_{i}=\sigma(\mathbf{W}_{m}[\sigma(\mathbf{W}_{h}\mathbf{x}_{i});\sigma(\mathbf{W}_{g}\mathbf{h}_{g})]), and λ=(𝐖m\lambda=(\mathbf{W}_{m}, 𝐖h\mathbf{W}_{h}, 𝐖g)\mathbf{W}_{g}) are the weight parameters. The negative samples to contrast with 𝐡g\mathbf{h}_{g} in the first term is Neg(𝐡g)=𝐡~iNeg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}_{i}. The negative samples to contrast with 𝐱i\mathbf{x}_{i} in the second term is Neg(𝐱i)=𝐡~iNeg(\mathbf{x}_{i})=\widetilde{\mathbf{h}}_{i}. The negative samples to contrast with 𝐡i\mathbf{h}_{i} in the third term is Neg(𝐡i)=𝐦~i=σ(𝐖m[σ(𝐖h𝐱~i);σ(𝐖g𝐡g)])Neg(\mathbf{h}_{i})=\widetilde{\mathbf{m}}_{i}=\sigma(\mathbf{W}_{m}[\sigma(\mathbf{W}_{h}\widetilde{\mathbf{x}}_{i});\sigma(\mathbf{W}_{g}\mathbf{h}_{g})]).

Deep Multiplex Graph Infomax (DMGI) [80] extents the idea of DGI to multiplex graphs where nodes are connected by multiple types of relations. For each relation type rr\in\mathcal{R} (corresponding to a relation graph g(r)=(𝐀(r),𝐗)g^{(r)}=(\mathbf{A}^{(r)},\mathbf{X})), a relation-type graph encoder fθr()f_{\theta_{r}}(\cdot) is applied to obtain the relation-specific node embedding matrix 𝐇(r)=fθr(𝐀(r),𝐗)\mathbf{H}^{(r)}=f_{\theta_{r}}(\mathbf{A}^{(r)},\mathbf{X}). Besides, it employs a READOUT\operatorname{READOUT} function to obatain the graph-level representation 𝐡g(r)=READOUT(𝐇(r))\mathbf{h}_{g^{(r)}}=\operatorname{READOUT}(\mathbf{H}^{(r)}). Finally, it independently maximizes the mutual Information between the node embeddings 𝐇(r)={𝐡1(r),𝐡2(r),,𝐡N(r)}\mathbf{H}^{(r)}=\left\{\mathbf{h}_{1}^{(r)},\mathbf{h}_{2}^{(r)},\ldots,\mathbf{h}_{N}^{(r)}\right\} and the graph-level summary 𝐡g(r)\mathbf{h}_{g^{(r)}} pertaining to each graph g(r)g^{(r)}, defined as

maxθr1Nvi𝒱(𝐡g(r),𝐡i(r)),r\max_{\theta_{r}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{g^{(r)}},\mathbf{h}^{(r)}_{i}\right),r\in\mathcal{R} (88)

where the negative samples to contrast with 𝐡g(r)\mathbf{h}_{g^{(r)}} is Neg(𝐡g(r))={𝐡j(r)}vj𝒱,jiNeg(\mathbf{h}_{g^{(r)}})=\{\mathbf{h}_{j}^{(r)}\}_{v_{j}\in\mathcal{V},j\neq i}. Similarly, it can learn another embedding matrix from the augmented graph g~(r)=(𝐀~(r),𝐗~)=𝒯(𝐀(r),𝐗)\widetilde{g}^{(r)}=(\widetilde{\mathbf{A}}^{(r)},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A}^{(r)},\mathbf{X}) and also maximize the MI of the node embeddings 𝐇~(r)={𝐡~1(r),𝐡~2(r),,𝐡~N(r)}\widetilde{\mathbf{H}}^{(r)}=\left\{\widetilde{\mathbf{h}}_{1}^{(r)},\widetilde{\mathbf{h}}_{2}^{(r)},\ldots,\widetilde{\mathbf{h}}^{(r)}_{N}\right\} and the graph-level summary 𝐡~g~(r)\widetilde{\mathbf{h}}_{\widetilde{g}^{(r)}}. However, as each 𝐇(r)\mathbf{H}^{(r)} is trained independently for each rr\in\mathcal{R}, these embedding matrices may fail to take advantage of the multiplexity of the network. Therefore, DMGI learns another consensus embedding matrix 𝐙\mathbf{Z} on which relation-specific node embedding matrices 𝐇(r)\mathbf{H}^{(r)} and 𝐇~(r)\widetilde{\mathbf{H}}^{(r)} can agree with each other by optimizing the learning objective, as follows

cs=[𝐙1||r𝐇(r)]2[𝐙1||r𝐇~(r)]2\ell_{\mathrm{cs}}=\Big{[}\mathbf{Z}-\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\mathbf{H}^{(r)}\Big{]}^{2}-\Big{[}\mathbf{Z}-\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\widetilde{\mathbf{H}}^{(r)}\Big{]}^{2} (89)

where 𝐙\mathbf{Z} is defined as a set of trainable parameters.

HDGI [31]. A meta-path of length ll is a sequence of nodes connected with heterogeneous edges, i.e., Φ:v1t1v2t2tlvl\Phi:v_{1}\stackrel{{\scriptstyle t_{1}}}{{\longrightarrow}}v_{2}\stackrel{{\scriptstyle t_{2}}}{{\longrightarrow}}\ldots\stackrel{{\scriptstyle t_{l}}}{{\longrightarrow}}v_{l}, where tl𝒯et_{l}\in\mathcal{T}^{e} denote the type of ll-th edge in the meta-path. Given a meta-path Φ\Phi, if there exist a meta-path Φ\Phi between node viv_{i} and node vjv_{j}, it defines that viv_{i} and vjv_{j} are connected neighbors based on the meta-path Φ\Phi, thus obtaining a meta-path based adjacent matrix 𝐀Φ\mathbf{A}^{\Phi}. Given a meta-path set {Φk}k=1K\{\Phi_{k}\}_{k=1}^{K}, we can obtain KK meta-path based adjacent matrix {𝐀Φk}k=1K\left\{\mathbf{A}^{\Phi_{k}}\right\}_{k=1}^{K}. HDGI first applies the augmentation to obtain 𝐗~,{𝐀~Φk}k=1K=𝒯(𝐗,{𝐀Φk}k=1K)\widetilde{\mathbf{X}},\{\widetilde{\mathbf{A}}^{\Phi_{k}}\}_{k=1}^{K}=\mathcal{T}(\mathbf{X},\{\mathbf{A}^{\Phi_{k}}\}_{k=1}^{K}). Then KK graph encoder are applied to obtain node embedding matrices 𝐇(k)=fθk(𝐀Φk,𝐗)\mathbf{H}^{(k)}=f_{\theta_{k}}(\mathbf{A}^{\Phi_{k}},\mathbf{X}) and 𝐇~(k)=fθk(𝐀~Φk,𝐗~)\widetilde{\mathbf{H}}^{(k)}=f_{\theta_{k}}(\widetilde{\mathbf{A}}^{\Phi_{k}},\widetilde{\mathbf{X}}) (1kK)(1\leq k\leq K). To obtain the more general node representations, the attention mechanism can be applied to fuse these representations 𝐇fuse=k=1Ksk𝐇(k)\mathbf{H}^{fuse}=\sum_{k=1}^{K}s_{k}\mathbf{H}^{(k)}, where the attention scores {sk}k=1N\{s_{k}\}_{k=1}^{N} are defined as follow

sk=exp(e(k))k=1Kexp(e(k)),e(k)=1Ni=1Ntanh(𝐪T[𝐖𝐡i(k)+𝐛])\displaystyle s_{k}\!=\!\frac{\exp(e^{(k)})}{\sum_{k=1}^{K}\exp(e^{(k)})},e^{(k)}\!=\!\frac{1}{N}\sum_{i=1}^{N}\tanh(\mathbf{q}^{T}[\mathbf{W}\mathbf{h}^{(k)}_{i}+\mathbf{b}]) (90)

where 1kK1\leq k\leq K, 𝐖\mathbf{W} and 𝐛\mathbf{b} are the shared weight matrix and shared bias vector, and 𝐪\mathbf{q} is a shared attention vector. Similarity, it can obtain the fused representation 𝐇~fuse=k=1Ks~k𝐇~(k)\widetilde{\mathbf{H}}^{fuse}=\sum_{k=1}^{K}\widetilde{s}_{k}\widetilde{\mathbf{H}}^{(k)}. Besides, a READOUT()\operatorname{READOUT}(\cdot) function are applied to obtain the global-level representation 𝐡g=READOUT(𝐇fuse)\mathbf{h}_{g}=\operatorname{READOUT}(\mathbf{H}^{fuse}). The objective is defined as:

maxθ1,θ2,,θK,γ1Nvi𝒱(𝐡g,𝐡ifuse)\max_{\theta_{1},\theta_{2},\cdots,\theta_{K},\gamma}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}^{fuse}_{i}\right) (91)

where γ=(𝐖,𝐛,𝐪)\gamma=(\mathbf{W},\mathbf{b},\mathbf{q}) are the weight parameters, and the negative samples to contrast with 𝐡g\mathbf{h}_{g} is Neg(𝐡g)=𝐡~ifuseNeg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}^{fuse}_{i}.

DITNet [95] propose an end–to–end model to predict drug-target interactions on heterogeneous graphs. Specifically, it learns high-quality representations for downstream tasks by performing local-gobal and context-global contrasting.

A.4 Local-Context Contrasting

Context Prediction [11] is proposed to maps nodes appearing in similar structural contexts to similar embeddings. It first picks up an anchor node set 𝒮\mathcal{S} from 𝒱\mathcal{V}. Given an anchor node vi𝒮v_{i}\in\mathcal{S}, it defines its KK-hop neighborhood graph gi(1)=(𝐀i(1),𝐗i(1))g_{i}^{(1)}=(\mathbf{A}_{i}^{(1)},\mathbf{X}_{i}^{(1)}) as all nodes and edges that are at most KK-hops away from viv_{i}. The context graph represents a subgraph g2=(𝐀i(2),𝐗i(2))g_{2}=(\mathbf{A}_{i}^{(2)},\mathbf{X}_{i}^{(2)}) that is between r1r_{1}-hops (it requires r1<Kr_{1}<K) and r2r_{2}-hops away from viv_{i} (i.e., it is a ring of width r2r1r_{2}-r_{1}). Two graph encoder fθ1()f_{\theta_{1}}(\cdot) and fθ2()f_{\theta_{2}}(\cdot) are then applied to encode two graphs as node embedding matrices 𝐇i(1)=fθ1(𝐀i(1),𝐗i(1))\mathbf{H}_{i}^{(1)}=f_{\theta_{1}}(\mathbf{A}_{i}^{(1)},\mathbf{X}_{i}^{(1)}) and 𝐇i(2)=fθ2(𝐀i(2),𝐗i(2))\mathbf{H}_{i}^{(2)}=f_{\theta_{2}}(\mathbf{A}_{i}^{(2)},\mathbf{X}_{i}^{(2)}). Besides, a READOUT()\operatorname{READOUT}(\cdot) function are further applied to obtain the subgraph-level representation, as follows

𝐡i(g)=1|i|vji(𝐇i(2))vj\mathbf{h}^{(g)}_{i}=\frac{1}{|\mathcal{M}_{i}|}\sum_{v_{j}\in\mathcal{M}_{i}}(\mathbf{H}_{i}^{(2)})_{v_{j}} (92)

where i\mathcal{M}_{i} is the set of those nodes that are shared between the neighborhood graph gi(1)g_{i}^{(1)} and the context graph gi(2)g_{i}^{(2)}, and we refers to those nodes as context anchor nodes. Finally, the learning objective is defined as follows

maxθ1,θ21|𝒮|vi𝒮((𝐇i(1))vi,𝐡i(g))\max_{\theta_{1},\theta_{2}}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\mathcal{MI}\left((\mathbf{H}^{(1)}_{i})_{v_{i}},\mathbf{h}_{i}^{(g)}\right) (93)

where (𝐇i(1))vi(\mathbf{H}^{(1)}_{i})_{v_{i}} is the node representation of anchor node viv_{i} in the node embedding matrix 𝐇i(1)\mathbf{H}^{(1)}_{i}. The negative samples to contrast with (𝐇i(1))vi(\mathbf{H}^{(1)}_{i})_{v_{i}} is Neg((𝐇i(1))vi)={𝐡j(g)}vj𝒮,jiNeg((\mathbf{H}^{(1)}_{i})_{v_{i}})=\{\mathbf{h}_{j}^{(g)}\}_{v_{j}\in\mathcal{S},j\neq i}.

GraphLog [96] is a unified graph SSL framework. In addition to the local-local contrasting (similar to GRACE [26]) and global-global contrasting (similar to GraphGL [25]), GraphLoG leverages an RPCL[97]-based clustering , e.g., KK-means clustering, to learn hierarchical prototypes of graph data and then perform local-context contrasting (similar to GIC [48]) for each hierarchical layer, respectively.

MHCN [98] adopts a DGI-like auxiliary task to enhance social recommendation by leveraging high-order user relations. However, considering that the DGI-like local-global contrasting [9] stays at a coarse level and there is no guarantee that the encoder can distill sufficient information from the input data, MHCN extends the DGI to a fine-grained local-context contrasting by exploiting the hierarchical structure in the input hypergraphs.

EGI [36] considers transfer learning on graphs, i.e., the pre-training of GNNs. Specifically, unlike DGI that models the local-global mutual information, EGI samples a set of ego-subgraph and then directly optimizes the local-context MI maximization between the structural input and output of GNN, with a particular focus on the structural information.

A.5 Context-Global Contrasting

SUGAR [84] is a novel hierarchical subgraph-level framework, which encourages subgraph embedding to be mindful of the global structural properties by maximizing their mutual information. For a given g=(𝐀,𝐗)𝒢g=(\mathbf{A},\mathbf{X})\in\mathcal{G}, it first applies an augmentation transformation to obtain g~=(𝐀~,𝐗~)=𝒯(𝐀,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X}). Then it samples subgraphs from each graph and collect them into a subgraph candidate pool. Besides, a shared graph encoder fθ()f_{\theta}(\cdot) and a READOUT funtion are applied to encode these subgraphs into subgraph-level representations {𝐡1,𝐡2,}\{\mathbf{h}_{1},\mathbf{h}_{2},\cdots\} and {𝐡~1,𝐡~2,}\{\widetilde{\mathbf{h}}_{1},\widetilde{\mathbf{h}}_{2},\cdots\}. Next, KK striking subgraphs are selected from candidate pool by a reinforcement learning module and pooled into a sketched graph where each node vkv_{k} (1kK1\leq k\leq K) corresponds to a subgraph with embedding 𝐡k\mathbf{h}_{k} for graph gg (𝐡~k\widetilde{\mathbf{h}}_{k} for graph g~\widetilde{g}). Finally, the learning objective is defined as follows

maxθ1|𝒢|g𝒢k=1K(𝐡g,𝐡k)\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g\in\mathcal{G}}\sum_{k=1}^{K}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{k}\right) (94)

where 𝐡g=READOUT({𝐡k}k=1K)\mathbf{h}_{g}=\text{READOUT}(\{\mathbf{h}_{k}\}_{k=1}^{K}) and the negative samples to contrast with 𝐡g\mathbf{h}_{g} is Neg(𝐡g)={𝐡~j}j=1KNeg(\mathbf{h}_{g})=\{\widetilde{\mathbf{h}}_{j}\}_{j=1}^{K}.

Head-Tail Contrastive (HTC) [99] is proposed to enhance the graph-level representations learned by GNNs. Given a graph gi=(𝐀i,𝐗i)𝒢g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}, it first applies an augmentation transformation 𝒯()\mathcal{T}(\cdot) to obtain g~i=(𝐀~i,𝐗~i)=𝒯(𝐀i,𝐗i)\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i}). Then a shared graph encoder fθ1()f_{\theta_{1}}(\cdot) are applied to gig_{i} and g~i\widetilde{g}_{i} to obtain node embedding matirces 𝐇i\mathbf{H}_{i} and 𝐇~i\widetilde{\mathbf{H}}_{i}, respectively. Besides, a READOUT()\text{READOUT}(\cdot) function is used to obtain graph-level representations 𝐡gi\mathbf{h}_{g_{i}} and 𝐡~g~i\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}. Moreover, HTC samples SS subsets from 𝐇i{𝐇i(1),𝐇i(1),,𝐇i(S)}\mathbf{H}_{i}\text{, }\{\mathbf{H}_{i}^{(1)},\mathbf{H}_{i}^{(1)},\cdots,\mathbf{H}_{i}^{(S)}\} which corresponds to subgraphs {gi(1),gi(1),,gi(S)}\{g_{i}^{(1)},g_{i}^{(1)},\cdots,g_{i}^{(S)}\}. Then HTC calculates a subgraph-level representation by

𝐡¯gi=fθ2([𝐡gi(1);𝐡gi(2);;𝐡gi(S)])\overline{\mathbf{h}}_{g_{i}}=f_{\theta_{2}}\Big{(}\big{[}\mathbf{h}_{g_{i}^{(1)}};\mathbf{h}_{g_{i}^{(2)}};\cdots;\mathbf{h}_{g_{i}^{(S)}}\big{]}\Big{)} (95)

where 𝐡gi(s)=READOUT(𝐇i(s))(1sS)\mathbf{h}_{g_{i}^{(s)}}=\operatorname{READOUT}(\mathbf{H}_{i}^{(s)})(1\leq s\leq S) and fθ2()f_{\theta_{2}}(\cdot) is a (S,1)(S,1) size convolution kernel. Finally, the learning objective is defined as follows

maxθ1,θ21|𝒢|gi𝒢(𝐡gi,𝐡¯gi)\max_{\theta_{1},\theta_{2}}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(\mathbf{h}_{g_{i}},\overline{\mathbf{h}}_{g_{i}}\right) (96)

where the negative samples to contrast with 𝐡gi\mathbf{h}_{g_{i}} is Neg(𝐡gi)={𝐡gj}gj𝒢,ji𝐡~g~iNeg(\mathbf{h}_{g_{i}})=\{\mathbf{h}_{g_{j}}\}_{g_{j}\in\mathcal{G},j\neq i}\cup\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}.

B. Generative Methods

In this section, we will continue with some generative methods for graph SSL. However, we will not provide detailed mathematical formulas for some relatively less important works to avoid over-redundancy.

B.1 Graph Autoencoding

Node Attribute Masking [10]. This task is similar to Graph Completion [12], but in this case it masks and reconstructs the features of multiple nodes simultaneously, and it no longer requires that the neighboring node features used for message passing must be unmasked features. It first randomly masks (i.e., set equal to zero) the features of a node set 𝒮\mathcal{S}. Specifically, it obtains a masked node feature matrix 𝐗^\widehat{\mathbf{X}} where 𝐱^i=𝐱i𝐦i\widehat{\mathbf{x}}_{i}=\mathbf{x}_{i}\odot\mathbf{m}_{i} for vi𝒮v_{i}\in\mathcal{S}, and then ask the model to reconstruct these masked features. More formally,

ssl(θ,𝐀,𝐗^)=1|𝒮|vi𝒮fθ(𝐀,𝐗^)vi𝐱i2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2} (97)

G-BERT [88] combines the power of GNNs and BERT to performs representation learning for medication recommendation. Specifically, G-BERT uses two BERT-like self-supervised pretext tasks, self-prediction and dual-prediction. The self-prediction takes the learned embedding as input to reconstruct the masked medical codes with the same type, while dual-prediction takes one type of embedding as input and tries to reconstruct the other type.

SLAPS [100] is a graph learning framework that tasks the node attribute reconstruction as the self-supervised pretext task to infer a task-specific latent structure and then apply a graph neural networks on the inferred graph structure.

To tackle the cold-start problem for recommendation tasks, Pretrain-Recsys [82] pre-trains the model by simulating the cold-start scenarios. Specifically, it applies an attention-based meta aggregator to reduce the impact from the cold-start neighbors and takes the target embedding reconstruction as the self-supervised pretext task.

Graph-Bert [15] is a novel graph neural network that pre-trains the model with the self-supervised node attribute reconstruction and structure recovery tasks, and then transfers the model to other downstream tasks directly or with necessary fine-tuning if any supervised label information is available.

C. Predictive Methods

In this section, we will continue with some predictive methods for graph SSL. However, we will not provide detailed mathematical formulas for some relatively less important works to avoid over-redundancy.

C.1 Context-based Prediction (CP)

PairwiseDistance [10]. The pretext task of PairwiseDistance aims to guide the model to preserve global topology information by predicting the shortest path length between different node pairs. More specifically, it first randomly samples a certain amount of node pairs 𝒮\mathcal{S} from all node pairs {(vi,vj)|vi,vj𝒱}\{(v_{i},v_{j})|v_{i},v_{j}\in\mathcal{V}\} and calculates the pairwise node shortest path length di,j=d(vi,vj)d_{i,j}=d(v_{i},v_{j}) for node pairs (vi,vj)𝒮(v_{i},v_{j})\in\mathcal{S}. Furthermore, it groups the shortest path lengths into four categories: Ci,j=0,Ci,j=1,Ci,j=2C_{i,j}=0,C_{i,j}=1,C_{i,j}=2, and Ci,j=3C_{i,j}=3 corresponding to di,j=1,di,j=2,di,j=3d_{i,j}=1,d_{i,j}=2,d_{i,j}=3, and di,j3d_{i,j}\geq 3, respectively. The learning objective can then be formulated as a multi-class classification problem, as follows

ssl(θ,ω,𝐀,𝐗)=1|𝒮|\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=\frac{1}{|\mathcal{S}|} (vi,vi)𝒮(fw(|fθ(𝐀,𝐗)vi\displaystyle\sum_{(v_{i},v_{i})\in\mathcal{S}}\ell\Big{(}f_{w}\big{(}|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}} (98)
fθ(𝐀,𝐗)vj|),Ci,j)\displaystyle-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}|\big{)},C_{i,j}\Big{)}

where ()\ell(\cdot) denotes the cross entropy loss and fω()f_{\omega}(\cdot) linearly maps the input to a 1-dimension value. Compared with the task of S2GRL\text{S}^{2}\text{GRL} [59], PairwiseDistance truncates the shortest path longer than 4, mainly to avoid the excessive computational burden and to prevent very noisy ultra-long pairwise distances from dominating the optimization.

EdgeMask [10]. Different from masking edge features, we can also mask edge connections, but instead of reconstructing the entire adjacency matrix directly, EdgeMask takes the link prediction as a pretext task. More specifically, it first masks mm edges \mathcal{M}\in\mathcal{E} and also samples mm edges ¯{(vi,vj)|vi,vj𝒱\overline{\mathcal{M}}\in\{(v_{i},v_{j})|v_{i},v_{j}\in\mathcal{V} and (vi,vj)}(v_{i},v_{j})\notin\mathcal{E}\}. Then, learning objective of EdgeMask is to predict whether there exists a link between a given node pair. More formally,

ssl(θ,ω,𝐀,𝐗)=12m(\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=\frac{1}{2m}\Big{(} (99)
(vi,vj)(fw(|fθ(𝐀,𝐗)vifθ(𝐀,𝐗)vj|),1)+\displaystyle\sum_{(v_{i},v_{j})\in\mathcal{M}}\ell\big{(}f_{w}(|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}|),1\big{)}+
(vi,vj)¯(fw(|fθ(𝐀,𝐗)vifθ(𝐀,𝐗)vj|),0))\displaystyle\sum_{(v_{i},v_{j})\in\overline{\mathcal{M}}}\ell\big{(}f_{w}(|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}|),0\big{)}\Big{)}

where ()\ell(\cdot) denotes the cross entropy loss and fω()f_{\omega}(\cdot) linearly maps the input to a 1-dimension value. The EdgeMask task aims to help GNN learn more local structural information.

TopoTER [101] amis to maximize the mutual information between topology transformations and node representations before and after the transformations, which can be relaxed to minimizing the cross entropy between the applied topology transformation types and its estimation from node representations. Given a graph g=(𝐀,𝐗)g=(\mathbf{A},\mathbf{X}), it first randomly samples a subset 𝒮1\mathcal{S}_{1} of MM connected edges and a subset 𝒮2\mathcal{S}_{2} of MM disconnected edges. Then it randomly removes rMr\cdot M edges from 𝒮1\mathcal{S}_{1} and adds rMr\cdot M edges to 𝒮2\mathcal{S}_{2}, where rr is the edge perturbation rate, to obatain a topology-perturbation graph g~=(𝐀~,𝐗)\widetilde{g}=(\widetilde{\mathbf{A}},\mathbf{X}). Next, it applies a shared graph encoder fθ()f_{\theta}(\cdot) to obtain their node embedding matrices 𝐇=fθ(𝐀,𝐗)\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X}) and 𝐇~=fθ(𝐀~,𝐗)\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\mathbf{X}). Finally, a mapping function gω()g_{\omega}(\cdot) is applied to the difference Δ𝐇=𝐇𝐇~\Delta\mathbf{H}=\mathbf{H}-\widetilde{\mathbf{H}} to predict transformation types 𝐘^i,j=gω(Δ𝐇i,Δ𝐇j)4\widehat{\mathbf{Y}}_{i,j}=g_{\omega}(\Delta\mathbf{H}_{i},\Delta\mathbf{H}_{j})\in\mathbb{R}^{4} for any ei,j𝒮1𝒮2e_{i,j}\in\mathcal{S}_{1}\bigcup\mathcal{S}_{2}. Then, the learning objective of TopoTER is defined as

ssl(θ,ω,𝐀,𝐗)=ei,j𝒮1𝒮2c=03𝐘i,j(c)log(𝐘^i,j(c))\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=-\sum_{e_{i,j}\in\mathcal{S}_{1}\bigcup\mathcal{S}_{2}}\sum_{c=0}^{3}\mathbf{Y}^{(c)}_{i,j}\log(\widehat{\mathbf{Y}}^{(c)}_{i,j}) (100)

where cc denotes four transformation types, c=0c=0: add an edge to a disconnected node pair in 𝒮2\mathcal{S}_{2}; c=1c=1: reomove an edge from a connected node pair in 𝒮1\mathcal{S}_{1}; c=2c=2: keep the connection between node pairs in 𝒮1\mathcal{S}_{1}; c=3c=3: keep the disconnection between node pairs in 𝒮2\mathcal{S}_{2}. 𝐘i,j(c)\mathbf{Y}^{(c)}_{i,j} is the ground-truth binary indicator (0 or 1) for each transformation type.

Centrality Score Ranking [17]. Node centrality is an important metric for graphs, which measures the importance of nodes based on their structural roles in the whole graph. However, different from node property, centrality scores are not comparable among different graphs with different scales. Therefore, it resorts to rank the relative orders between nodes and consider them as pseudo labels. Specifically, four different centrality scores 𝒮={\mathcal{S}=\{eigencentrality, betweenness, closeness, subgraph centrality}\} are used. For a given centrality score s𝒮s\in\mathcal{S} and a node pair (vi,vj)(v_{i},v_{j}) with relative order 𝐘i,js=𝕀(si>sj)\mathbf{Y}_{i,j}^{s}=\mathbb{I}(s_{i}>s_{j}), a mapping function fθ(s)()f_{\theta^{(s)}}(\cdot) is applied to estimate its rank score by 𝐫s=fθ(s)(𝐀,𝐗)\mathbf{r}^{s}=f_{\theta^{(s)}}(\mathbf{A},\mathbf{X}), where 𝐫is\mathbf{r}^{s}_{i} denotes the rank score of node viv_{i}. The probability of estimated rank order is defined as

𝐘^i,js=exp(𝐫is𝐫js)1+exp(𝐫is𝐫js)\widehat{\mathbf{Y}}_{i,j}^{s}=\frac{\exp(\mathbf{r}^{s}_{i}-\mathbf{r}^{s}_{j})}{1+\exp(\mathbf{r}^{s}_{i}-\mathbf{r}^{s}_{j})} (101)

The learning objective of this pretext task is defined as

ssl\displaystyle\mathcal{L}_{ssl} ({θ(s)}s=14,𝐀,𝐗)=14N2s𝒮\displaystyle\left(\{\theta^{(s)}\}_{s=1}^{4},\mathbf{A},\mathbf{X}\right)=-\frac{1}{4N^{2}}\sum_{s\in\mathcal{S}} (102)
i,j𝒱[𝐘i,jslog(𝐘^i,js)+(1𝐘i,js)log(1𝐘^i,js)]\displaystyle\sum_{i,j\in\mathcal{V}}\Big{[}\mathbf{Y}^{s}_{i,j}\log(\widehat{\mathbf{Y}}^{s}_{i,j})+(1-\mathbf{Y}^{s}_{i,j})\log(1-\widehat{\mathbf{Y}}^{s}_{i,j})\Big{]}

ContextLabel [10]. Different from Distance2Labeled, which uses relative position as a self-supervised signal, the pretext task of ContextLabel works by constructing a local distribution for each node and then asking the model to regress these distributions. It first assigns labels for unlabeled node with the Label Propagation (LP) algorithm [102], and then defines the label distribution 𝐲i\mathbf{y}_{i} for node viv_{i} within its kk-hop neighborhood, where the cc-th element of the label distribution vector 𝐲i,c\mathbf{y}_{i,c} can be defined as

𝐲i,c=|𝒩i𝒱L(c)|+|𝒩i𝒱U(c)||𝒩i𝒱L|+|𝒩i𝒱U|,c=1,2,,C\mathbf{y}_{i,c}=\frac{\left|\mathcal{N}_{i}^{\mathcal{V}_{L}}(c)\right|+\left|\mathcal{N}_{i}^{\mathcal{V}_{U}}(c)\right|}{\left|\mathcal{N}_{i}^{\mathcal{V}_{L}}\right|+\left|\mathcal{N}_{i}^{\mathcal{V}_{U}}\right|},c=1,2,\cdots,C (103)

where 𝒩i𝒱L\mathcal{N}_{i}^{\mathcal{V}_{L}} and 𝒩i𝒱U\mathcal{N}_{i}^{\mathcal{V}_{U}} are the labeled nodes and unlabeled nodes within kk-hop neighborhood of node viv_{i}, respectively. 𝒩i𝒱L(c)\mathcal{N}_{i}^{\mathcal{V}_{L}}(c) denotes only those in the neighborhood set with ground-truth label cc, and 𝒩i𝒱U(c)\mathcal{N}_{i}^{\mathcal{V}_{U}}(c) denotes those in the neighborhood set that are assigned label cc by Label Propagation (LP) algorithm. The learning objective is defined as

ssl(θ,𝐀,𝐗)=1Nvi𝒱fθ(𝐀,𝐗)vi𝐲i2\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{y}_{i}\Big{\|}^{2} (104)

In addition to Label Propagation, there are numerous algorithms that can be used to compute pseudo-labels, such as the Iterative Classification Algorithm (ICA) [103], and even a combination of LP and ICA for better performance.

Motivated by the observation that anomalous nodes differ from normal nodes in structures and attributes, Hop-Count based Model (HCM) propose to use global context prediction as a self-supervised task for modeling both local and global contextual information and then achieve better anomaly detection on the attributed networks.

C.2 Self-Training (ST)

SEF [104] presents a framework that first trains a graph attention network model with pseudo labels obtained from unsupervised Louvain clustering [72] and then uses the learned edge attention coefficients as self-supervised edge features. Besides, it encodes the learned edge features via a Set Transformer and combines them with node features for node classification in an end-to-end training manner.

C.3 Domain Knowledge-based Prediction (DK)

DrRepair [105] considers the problem of learning to repair programs from diagnostic feedback by building a program feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback. Besides, it leverages unlabeled programs available online to create a large amount of extra program repair examples, which are used to pre-train a graph neural network for program repair.

D. Summary of Surveyed Works

A summary of all the surveyed works (a total of 71 methods) is presented in Table. A1 and Table. A2, including graph property, pretext task type, augmentation strategy, objective function, training strategy, and the year of publication.

E. Summary of Implementation Details

The implementation details of all the surveyed works are presented in Table. A3 and Table. A4, such as the (node/link/graph) level of downstream tasks, evaluation metrics with specific tasks, and the commonly used datasets.

F. Summary of Common Datasets

A summary and statistics of common graph datasets is shown in Table. A5, including category, graph number, node number per graph, edge number per graph, dimensionality of node attributes, class number and citation papers.

G. Summary of Open-source Codes

A summary of the open-source codes of all the surveyed works is presented in Table. A6, where we provide hyperlinks to their open-source codes, and those for which no open-source code is found are indicated by “N.A.”. Moreover, we have created a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning to summarize the latest advances in graph SSL, which will be updated in real-time as more papers and their codes become available.

TABLE A1: A summary of the surveyed papers. The summary is organized based on the underlying motivation behind pretext task design.
Methods Graph Property Pretext Task Data Augmentation Objective Function Training Strategy Year
Graph Completion [12] Attributed Generative/AE Attribute Masking MAE P&F/JL 2020
Node Attribute
Masking [10]
Attributed Generative/AE Attribute Masking MAE P&F/JL 2020
Edge Attribute
Masking [11]
Attributed Generative/AE Attribute Masking MAE P&F 2019
Node Attribute
Denoising [13]
Attributed Generative/AE Attribute Masking MAE JL 2020
Adjacency Matrix
Reconstruction [14]
Attributed Generative/AE
Attribute Masking
Edge Perturbation
MAE JL 2020
Graph Bert [15] Attributed Generative/AE
Attribute Masking
Edge Perturbation
MAE P&F 2020
Pretrain-Recsys [82] Attributed Generative/AE Edge Perturbation MAE P&F 2021
GPT-GNN [32] Heterogeneous Generative/AR
Attribute Masking
Edge Perturbation
MAE/InfoNCE P&F 2020
GraphCL [25] Attributed Contrastive/G-G
Attribute Masking
Edge Perturbation
Random Walk Sampling
InfoNCE URL 2020
IGSD [33] Attributed Contrastive/G-G
Edge Perturbation
Edge Doffisopm
InfoNCE JL/URL 2020
DACL [91] Attributed Contrastive/G-G Mixup InfoNCE URL 2020
LCC [45] Attributed Contrastive/G-G None InfoNCE JL 2021
CSSL [34] Attributed Contrastive/G-G
Node Insertion
Edge Perturbation
Uniform Sampling
InfoNCE P&F/JL/URL 2020
GCC [38] Unattributed Contrastive/C-C Random Walk Sampling InfoNCE P&F/URL 2020
GRACE [26] Attributed Contrastive/L-L
Attribute Masking
Edge Perturbation
InfoNCE URL 2020
GCA [40] Attributed Contrastive/L-L Attention-based InfoNCE URL 2020
GROC [43] Attributed Contrastive/L-L Gradient-based InfoNCE URL 2021
SEPT [46] Attributed Contrastive/L-L Edge Perturbation InfoNCE JL 2021
STDGI [28] Spatial-Temporal Contrastive/L-L Attribute Shuffling JS Estimator URL 2019
GMI [16] Attributed Contrastive/L-L None SP Estimator URL 2020
KS2L [83] Attributed Contrastive/L-L None InfoNCE URL 2020
CG3\text{C}\text{G}^{3} [92] Attributed Contrastive/L-L None InfoNCE JL 2020
BGRL [27] Attributed Contrastive/L-L
Attribute Masking
Edge Perturbation
Inner Product URL 2021
TABLE A2: A summary of the surveyed papers. The summary is organized based on the underlying motivation behind pretext task design.
Methods Graph Property Pretext Task Data Augmentation Objective Function Training Strategy Year
SelfGNN [35] Attributed Contrastive/L-L
Attribute Masking
Edge Diffusion
MSE URL 2021
HeCo [47] Heterogeneous Contrastive/L-L None InfoNCE URL 2021
PT-DGNN [89] Dynamic Contrastive/L-L
Attribute Masking
Edge Perturbation
InforNCE P&F 2021
COAD [93] Attributed Contrastive/L-L None Triplet Margin Loss P&F 2020
Contrst-Reg [29] Attributed Contrastive/L-L Attribute Shuffling InfoNCE JL 2021
DGI [9] Attributed Contrastive/L-G Arbitrary JS Estimator URL 2019
HDMI [30] Attributed Contrastive/L-G Attribute Shuffling JS Estimator URL 2021
DMGI [80] Heterogeneous Contrastive/L-G Attribute Shuffling
JS Estimator
MAE
URL 2020
MVGRL [18] Attributed Contrastive/L-G
Attribute Masking
Edge Perturbation
Edge Diffusion
Random Walk Sampling
DV Estimator
JS Estimator
NT-Xent
InfoNCE
URL 2020
HDGI [31] Heterogeneous Contrastive/L-G Attribute Shuffling JS Estimator URL 2019
Subg-Con [19] Attributed Contrastive/L-C Importance Sampling Triplet Margin Loss URL 2020
Cotext Prediction [11] Attributed Contrastive/L-C Ego-nets Sampling Cross Entropy P&F 2019
GIC [48] Attributed Contrastive/L-C Arbitrary JS Estimator URL 2020
GraphLoG [96] Attributed
Contrastive/L-L
Contrastive/G-G
Contrastive/L-C
Attribute Masking InfoNCE URL 2021
MHCN [98] Heterogeneous Contrastive/L-C Attribute Shuffling InfoNCE JL 2021
EGI [36] Attributed Contrastive/L-C Ego-nets Sampling SP Estimator P&F 2020
MICRO-Graph [39] Attributed Contrastive/C-G Knowledge Sampling InfoNCE URL 2020
InfoGraph [49] Attributed Contrastive/C-G None SP Estimator URL 2019
SUGAR [84] Attributed Contrastive/C-G BFS Sampling JS Estimator JL 2021
BiGI [37] Heterogeneous Contrastive/C-G
Edge Perturbation
Ego-nets Sampling
JS Estimator JL 2021
HTC [99] Attributed Contrastive/C-G Attribute Shuffling
SP Estimator
DV Estimator
URL 2021
Node Property
Prediction [10]
Attributed Predictive/NP None MAE P&F/JL 2020
S2GRL\text{S}^{2}\text{GRL} [59]
Attributed Predictive/CP None Cross Entropy URL 2020
PairwiseDistance [10] Attributed Predictive/CP None Cross Entropy P&F/JL 2020
PairwiseAttrSim [10] Attributed Predictive/CP None MAE P&F/JL 2020
Distance2Cluster [10] Attributed Predictive/CP None MAE P&F/JL 2020
EdgeMask [10] Attributed Predictive/CP None Cross Entropy P&F/JL 2020
TopoTER [101] Attributed Predictive/CP Edge Perturbation Cross Entropy URL 2021
Centrality Score
Ranking [17]
Attributed Predictive/CP None Cross Entropy P&F 2019
Meta-path Prediction [69] Heterogeneous Predictive/CP None Cross Entropy JL 2020
SLiCE [70] Heterogeneous Predictive/CP None Cross Entropy P&F 2020
Distance2Labeled [10] Attributed Predictive/CP None MAE P&F/JL 2020
ContextLabel [10] Attributed Predictive/CP None MAE P&F/JL 2020
HCM [106] Attributed Predictive/CP Edge Perturbation Bayesian Inference URL 2021
Contextual Molecular
Property Prediction [76]
Attributed Predictive/DK None Cross Entropy P&F 2020
Graph-level
Motif Prediction [76]
Attributed Predictive/DK None Cross Entropy P&F 2020
Multi-stage
Self-training [71]
Attributed Predictive/ST None None JL 2018
Node Clustering [12] Attributed Predictive/ST None Clustering P&F/JL 2020
Graph Partitioning [12] Attributed Predictive/ST None Graph Partitioning P&F/JL 2020
CAGNN [74] Attributed Predictive/ST None Clustering URL 2020
M3S [75] Attributed Predictive/ST None Clustering JL 2020
Cluster Preserving [17] Attributed Predictive/ST None Cross Entropy P&F 2019

H. Experimental Study Results

Two downstream tasks, namely node and graph classification, and one evaluation metric, namely accuracy, are selected to conduct the experimental study. For node classification, we provide the classification performance of 23 classical methods on 8 commonly used datasets; for graph classification, we select 7 datasets and report the comparison results of 11 representative methods. The results of node classification and graph classification are shown in Table. A7 and Table. A8, respectively, and the best result in each dataset is marked in bold. The metrics are taken directly from their original papers or retrained with the open-source code. When the code is not publicly available, or when it is impractical to run the released code from scratch, we replace the corresponding results with a dash “-”. Moreover, fixed dataset splits are adopted for all methods to make a fair comparison (1) the data splits of Cora, Citeseer, and Pubmed are consistent with that in [22], (2) the data splits of Wiki-CS, Amazon-Computers, Amazon-Photo, Coauthor-CS, and Coauthor-Physics are consistent with that in [27], (3) the data splits of MUTAG, PTC, RDT-B, RDT-M5K, IMDB-B, and IMDB-M are consistent with that in [49], and (4) the data splits of NCI1 are consistent with that in [25].

After analyzing the results in Table. A7 and Table. A8, we have the following observations: (1) There is no single method that can achieve the best performance on all datasets, and the state-of-the-art method on one dataset may be suboptimal on another. This suggests that existing graph self-supervised algorithms are somewhat dependent on the properties of the graph and are not sufficiently general. (2) Algorithms that achieve excellent performance on small-scale datasets (e.g., Cora and Citeseer), such as Graph-Bert [15] and Sub-Con [19], may have suboptimal performance on large-scale datasets (e.g., Coauthor-CS and Coauthor-Physics). This inspires follow-up researchers to evaluate model performance on datasets with different scales, as experimental results on small-scale datasets may be biased and therefore less reliable. (3) Most generative and predictive learning methods are specifically designed for node-level tasks, such as node classification, but nevertheless, they are still not comparable to those state-of-the-art contrastive methods. This suggests that the potential of generative and predictive learning needs to be further explored, and that domain knowledge-based learning may be a promising direction. (4) Due to the lack of comparison with contemporaneous work, many state-of-the-art methods achieve roughly comparable performance on some datasets, such as Amazon-Computers and Coauthor-Physics, with differences smaller than one percent. The purpose of this experimental study is to provide a good foundation for follow-up researchers and to facilitate a fairer experimental comparison for graph self-supervised learning.

I. TimeLine

A complete timeline is provided in Fig. A1 to present a clear development lineage of various graph SSL methods, listing the publication dates (based on the arxiv) of key milestones. Besides, we provide inheritance connections between methods to show how they are developed. More importantly, we provide short descriptions of contributions for some pioneering methods to highlight their importance.

TABLE A3: A summary of implementation details about the surveyed papers, including downstream task level, evaluation metric, and dataset.
Methods Task Level Evaluation Metric Dataset
Graph Completion [12] Node Node Classification (Acc) Cora, Citeseer, Pubmed
Node Attribute
Masking [10]
Node Node Classification (Acc) Cora, Citeseer, Pubmed, Reddit
Edge Attribute
Masking [11]
Graph
Graph Classification (ROC-AUC)
MUTAG, PTC, PPI, BBBP, Tox21, ToxCast, ClinTox, MUV,
HIV, SIDER, BACE
Node Attribute
Denoising [13]
Node Node Classification (Acc) Cora, Citeseer, Pubmed
Adjacency Matrix
Reconstruction [14]
Node Node Classification (Acc) Cora, Citeseer, Pubmed
Graph Bert [15] Node
Node Classification (Acc),
Node Clustering (NMI)
Cora, Citeseer, Pubmed
Pretrain-Recsys [82] Node/Link - ML-1M, MOOCs, Last-FM
GPT-GNN [32] Node/Link
Node Classification (F1-score),
Link Prediction (ROC-AUC)
OAG, Amazon, Reddit
GraphCL [25] Graph
Graph Classification
(Acc, ROC-AUC)
NCI1, PROTEINS, D&D, COLLAB, RDT-B, RDT-M5K,
GITHUB, MNIST, CIFAR10, MUTAG, IMDB-B, BBBP,
Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE, PPI
IGSD [33] Graph Graph Classification (Acc)
MUTAG, PTC, NCI1, IMDB-B, QM9, COLLAB, IMDB-M
DACL [91] Graph Graph Classification (Acc)
MUTAG, PTC, IMDB-B, IMDB-M, RDT-B, RDT-M5K
LCC [45] Graph Graph Classification (Acc)
IMDB-B, IMDB-M, COLLAB, MUTAG, PROTEINS, PTC,
NCI1, D&D
CSSL [34] Graph Graph Classification (Acc) PROTEINS, D&D, NCI1, NCI109, Mutagenicity
GCC [38] Node/Graph
Node Classification (Acc),
Graph Classification (Acc)
US-Airport, H-index, COLLAB, IMDB-B, IMDB-M, RDT-B,
RDT-M5K
GRACE [26] Node
Node Classification
(Acc, Micro-F1)
Cora, Citeseer, Pubmed, DBLP, Reddit, PPI
GCA [40] Node Node Classification (Acc)
Amazon-Computers, Amazon-Photo, Coauthor-Physics,
Coauthor-CS, Wiki-CS
GROC [43] Node Node Classification (Acc) Cora, Citeseer, Pubmed, Amazon-Photo, Wiki-CS
SEPT [46] Node/Link - Last-FM, Douban, Yelp
STDGI [28] Node
Node Regression
(MAE, RMSE, MAPE)
METR-LA
GMI [16] Node/Link
Node Classification
(Acc, Micro-F1),
Link Prediction (ROC-AUC)
Cora, Citeseer, PubMed, Reddit, PPI, BlogCatalog, Flickr
KS2L [83] Node/Link
Node Classification (Acc),
Link Prediction (ROC-AUC)
Cora, Citeseer, Pubmed, Amazon-Computers,
Amazon-Photo, Coauthor-CS
CG3\text{C}\text{G}^{3} [92] Node Node Classification (Acc)
Cora, Citeseer, Pubmed, Amazon-Computers,
Amazon-Photo, Coauthor-CS
BGRL [27] Node
Node Classification
(Acc, Micro-F1)
Wiki-CS, Amazon-Computers, Amazon-Photo, PPI,
Coauthor-CS, Coauthor-Physics, ogbn-arxiv
SelfGNN [35] Node
Node Classification (Acc)
Cora, Citeseer, Pubmed, Amazon-Computers,
Amazon-Photo, Coauthor-CS, Coauthor-Physics
HeCo [35] Node
Node Classification (ROC-AUC,
Micro-F1, Macro-F1),
Node Clustering (NMI, ARI)
ACM, DBLP, Freebase, AMiner
PT-DGNN [89] Link Link Prediction (ROC-AUC) HepPh, Math Overflow, Super User
COAD [93] Node/Link
Node Clustering
(Precision, Recall, F1-score),
Link Prediction
(HitRatio@K, MRR)
AMiner, News, LinkedIn
Contrast-Reg [29] Node/Link
Node Classification (Acc),
Node Clustering
(NMI, Acc, Macro-F1),
Link Prediction (ROC-AUC)
Cora, Citeseer, Pubmed, Reddit, ogbn-arxiv, Wikipedia,
ogbn-products, Amazo-Computers, Amazo-Photo
DGI [9] Node
Node Classification
(Acc, Micro-F1)
Cora, Citeseer, Pubmed, Reddit, PPI
HDMI [30] Node
Node Classification
(Micro-F1, Macro-F1),
Node Clustering (NMI)
ACM, IMDB, DBLP, Amazon
DMGI [80] Node
Node Clustering (NMI),
Node Classification (Acc)
ACM, IMDB, DBLP, Amazon
MVGRL [18] Node/Graph
Node Classification (Acc),
Node Clustering (NMI, ARI),
Graph Classification (Acc)
Cora, Citeseer, Pubmed, MUTAG, PTC, IMDB-B,
IMDB-M, RDT-B
TABLE A4: A summary of implementation details about the surveyed papers, including downstream task level, evaluation metric, and dataset.
Methods Task Level Evaluation Metric Dataset
HDGI [31] Node
Node Classification
(Micro-F1, Macro-F1),
Node Clustering (NMI, ARI)
ACM, DBLP, IMDB
Subg-Con [19] Node
Node Classification
(Acc, Micro-F1)
Cora, Citeseer, Pubmed, PPI, Flickr, Reddit
Cotext Prediction [11] Graph
Graph Classification (ROC-AUC)
MUTAG, PTC, PPI, BBBP, Tox21, ToxCast, ClinTox,
MUV, HIV, SIDER, BACE
GIC [48] Node/Link
Node Classification (Acc),
Node Clustering
(Acc, NMI, ARI),
Link Prediction
(ROC-AUC, ROC-AP)
Cora, Citeseer, Pubmed, Amazon-Computers,
Amazon-Photo, Coauthor-CS, Coauthor-Physics
GraphLoG [96] Graph
Graph Classification (ROC-AUC)
BBBP, Tox21, ToxCast, ClinTox, MUV, HIV, SIDER,
BACE
MHCN [98] Node/Link - Last-FM, Douban, Yelp
EGI [36] Node/Link
Node Classification (Acc),
Link Prediction
(ROC-AUC, MRR)
YAGO, Airport
MICRO-Graph [39] Graph
Graph Classification (ROC-AUC)
BBBP, Tox21, ToxCast, ClinTox, HIV, SIDER, BACE
InfoGraph [49] Graph Graph Classification (Acc)
MUTAG, PTC, RDT-B, RDT-M5K, IMDB-B, QM9,
IMDB-M
SUGAR [84] Graph Graph Classification (Acc) MUTAG, PTC, PROTEINS, D&D, NCI1, NCI109
BiGI [37] Link
Link Prediction
(AUC-ROC, AUC-PR)
DBLP, ML-100K, ML-1M, Wikipedia
HTC [99] Graph Graph Classification (Acc)
MUTAG, PTC, IMDB-B, IMDB-M, RDT-B, QM9,
RDT-M5K
Node Property
Prediction [10]
Node Node Classification (Acc) Cora, Citeseer, Pubmed
S2GRL\text{S}^{2}\text{GRL} [59]
Node/Link
Node Classification
(Acc, Micro-F1),
Node Clustering (NMI),
Link Prediction (ROC-AUC)
Cora, Citeseer, Pubmed, PPI, Flickr, BlogCatalog,
Reddit
PairwiseDistance [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed
PairwiseAttrSim [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed
Distance2Cluster [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed
EdgeMask [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed
TopoTER [101] Node/Graph
Node Classification (Acc),
Graph Classification (Acc)
Cora, Citeseer, Pubmed, MUTAG, PTC, RDT-B,
RDT-M5K, IMDB-B, IMDB-M
Centrality Score
Ranking [17]
Node/Link/Graph
Node Classification (Micro-F1),
Link Prediction (Micro-F1),
Graph Classification (Micro-F1)
Cora, Pubmed, ML-100K, ML-1M, IMDB-M, IMDB-B
Meta-path Prediction [69] Node/Link
Node Classification (F1-score),
Link Prediction (ROC-AUC)
ACM, IMDB, Last-FM, Book-Crossing
SLiCE [70] Link
Link Prediction
(ROC-AUC, Micro-F1)
Amazon, DBLP, Freebase, Twitter, Healthcare
Distance2Labeled [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed
ContextLabel [10] Node Node Classification (Acc) Cora, Citeseer, Pubmed, Reddit
HCM [106] Node Node Classification (ROC-AUC) ACM, Amazon, Enron, BlogCatalog, Flickr
Contextual Molecular
Property Prediction [76]
Graph
Graph Classification (Acc),
Graph Regression (MAE)
BBBP, SIDER, ClinTox, BACE, Tox21, ToxCast, ESOL,
FreeSolv, Lipo, QM7, QM8
Graph-level
Motif Prediction [76]
Graph
Graph Classification (Acc),
Graph Regression (MAE)
BBBP, SIDER, ClinTox, BACE, Tox21, ToxCast, ESOL,
FreeSolv, Lipo, QM7, QM8
Multi-stage
Self-training [71]
Node Node Classification (Acc) Cora, Citeseer, Pubmed
Node Clustering [12] Node Node Classification (Acc) Cora, Citeseer, Pubmed
Graph Partitioning [12] Node Node Classification (Acc) Cora, Citeseer, Pubmed
CAGNN [74] Node
Node Classfication
(Micro-F1, Macro-F1),
Node Clustering
(Micro-F1, Macro-F1, NMI)
Cora, Citeseer, Pubmed
M3S [75] Node Node Classification (Acc) Cora, Citeseer, Pubmed
Cluster Preserving [17] Node/Link/Graph
Node Classification (Micro-F1),
Link Prediction (Micro-F1),
Graph Classification (Micro-F1)
Cora, Pubmed, ML-100K, ML-1M, IMDB-M, IMDB-B
TABLE A5: A summary and statistics of common graph datasets for graph self-supervised learning.
Dataset Category #Graph #Node (Avg.) #Edge (Avg.) #Feature #Class Citation
Cora [103] Citation Network 1 2708 5429 1433 7
[12], [10], [13], [14], [15], [26],
[43], [16], [92], [9], [18], [19],
[48], [59], [17], [71], [74], [75],
[29], [35], [83], [101]
Citeseer [107] Citation Network 1 3327 4732 3703 6
[12], [10], [13], [14], [15], [26],
[43], [16], [92], [9], [18], [19],
[48], [59], [71], [74], [75], [29],
[35], [83], [101]
Pubmed [108] Citation Network 1 19717 44338 500 3
[12], [10], [13], [14], [15], [26],
[43], [16], [92], [9], [18], [19],
[48], [59], [17], [71], [74], [75],
[29], [35], [83], [101]
Wiki-CS [109] Citation Network 1 11701 216123 300 10 [40], [43], [27]
Coauthor-CS [110] Citation Network 1 18333 81894 6805 15 [40], [92], [48], [27], [35], [83]
Coauthor-Physics [110] Citation Network 1 34493 247962 8415 5 [40], [48], [27], [35]
DBLP (v12) Citation Network 1 4894081 45564149 - -
[26], [80], [70], [37], [31], [30],
[47]
ogbn-arxiv [111] Citation Network 1 169343 1166243 128 40 [27], [29]
Reddit [8] Social Network 1 232965 11606919 602 41
[10], [32], [26], [16], [9], [19],
[59], [29]
BlogCatalog [112] Social Network 1 5196 171743 8189 6 [16], [59], [106]
Flickr [112] Social Network 1 7575 239738 12047 9 [16], [19], [59], [106]
COLLAB [77] Social Networks 5000 74.49 2457.78 - 2 [25], [33], [45], [38]
RDT-B [77] Social Networks 2000 429.63 497.75 - 2
[25], [38], [18], [49], [99], [101],
[91]
RDT-M5K [77] Social Networks 4999 508.52 594.87 - 5 [25], [38], [49], [99], [101], [91]
IMDB-B [77] Social Networks 1000 19.77 96.53 - 2
[25], [33], [45], [38], [18], [49],
[17], [80], [69], [31], [99], [30],
[101], [91]
IMDB-M [77] Social Networks 1500 13.00 65.94 - 3
[33], [45], [38], [18], [49], [17],
[80], [69], [31], [99], [30], [101],
[91]
ML-100K [113] Social Networks 1 2625 100000 - 5 [17], [37]
ML-1M [113] Social Networks 1 9940 1000209 - 5 [17], [37], [82]
PPI [79] Protein Networks 24 56944 818716 50 121
[11], [25], [26], [16], [16], [9],
[19], [59], [27]
D&D [78, 114] Protein Networks 1178 284.32 715.65 82 2 [25], [45], [34], [84]
PROTEINS [78, 115] Protein Networks 1113 39.06 72.81 4 2 [25], [45], [34], [84]
NCI1 [116, 77] Molecule Graphs 4110 29.87 32.30 37 2 [25], [33], [45], [34], [84]
MUTAG [117, 118] Molecule Graphs 188 17.93 19.79 7 2
[11], [25], [33], [45], [18], [49],
[84], [99], [101], [91]
QM9 (QM7, QM8) [119] Molecule Graphs 133885 - - - - [33], [49], [10], [99]
BBBP [120, 111] Molecule Graphs 2039 24.05 25.94 - 2 [11], [25], [39], [76], [96]
Tox21 [121, 122, 111] Molecule Graphs 7831 18.51 25.94 - 12 [11], [25], [39], [76], [96]
ToxCast [123, 111] Molecule Graphs 8575 18.78 19.26 - 167 [11], [25], [39], [76], [96]
ClinTox [124] Molecule Graphs 1478 26.13 27.86 - 2 [11], [25], [39], [76], [96]
MUV [125] Molecule Graphs 93087 24.23 26.28 - 17 [11], [25], [96]
HIV [111] Molecule Graphs 41127 25.53 27.48 - 2 [11], [25], [39], [96]
SIDER [126] Molecule Graphs 1427 33.64 35.36 - 27 [11], [25], [39], [76], [96]
BACE [127] Molecule Graphs 1513 34.12 36.89 - 2 [11], [25], [39], [76], [96]
PTC [114] Molecule Graphs 344 14.29 14.69 19 2
[33], [18], [49], [99], [101], [91],
[11], [45], [84]
NCI109 [116, 77] Molecule Graphs 4127 29.68 32.13 - 2 [34], [84]
Mutagenicity [128, 129] Molecule Graphs 4337 30.32 30.77 - 2 [34]
MNIST [130] Others (Image) - 70000 - 784 10 [25]
CIFAR10 [131] Others (Image) - 60000 - 1024 10 [25]
METR-LA [132] Others (Traffic) 1 207 1515 2 - [28]
Amazon-
Computers [110]
Others (Purchase) 1 13752 245861 767 10
[40], [92], [48], [32], [70], [27],
[106], [30], [29], [35], [83]
Amazon-Photo [110] Others (Purchase) 1 7650 119081 745 8
[40], [43], [92], [48], [32], [70],
[27], [106], [30], [29], [35], [83]
ogbn-products [111] Others (Purchase) 1 2449029 61859140 100 47 [29]
TABLE A6: A summary of open-source codes of the surveyed papers.
Methods Github
Graph Completion [12] https://github.com/Shen-Lab/SS-GCNs
Node Attribute Masking [10] https://github.com/ChandlerBang/SelfTask-GNN
Edge Attribute Masking [11] http://snap.stanford.edu/gnn-pretrain
Node Attribute Denoising [13] N.A.
Adjacency Matrix
Reconstruction [14]
N.A.
Graph Bert [15] https://github.com/anonymous-sourcecode/Graph-Bert
Pretrain-Recsys [82] https://github.com/jerryhao66/Pretrain-Recsys
GPT-GNN [32] https://github.com/acbull/GPT-GNN
GraphCL [25] https://github.com/Shen-Lab/GraphCL
IGSD [33] N.A.
DACL [91] N.A.
LCC [45]
https://github.com/YuxiangRen
CSSL [34] N.A.
GCC [38] https://github.com/THUDM/GCC
GRACE [26] https://github.com/CRIPAC-DIG/GRACE
GCA [40] https://github.com/CRIPAC-DIG/GCA
GROC [43] N.A.
SEPT [46] https://github.com/Coder-Yu/QRec
STDGI [28] N.A.
GMI [16] https://github.com/zpeng27/GMI
KS2L [83] N.A.
CG3\text{C}\text{G}^{3} [92] N.A.
BGRL [27] N.A.
SelfGNN [35] https://github.com/zekarias-tilahun/SelfGNN
HeCo [47] https://github.com/liun-online/HeCo
PT-DGNN [89] https://github.com/Mobzhang/PT-DGNN
COAD [93] https://github.com/allanchen95/Expert-Linking
Contrast-Reg [29] N.A.
DGI [9] https://github.com/PetarV-/DGI
HDMI [30] N.A.
DMGI [80] https://github.com/pcy1302/DMGI
MVGRL [18] https://github.com/kavehhassani/mvgrl
HDGI [31] https://github.com/YuxiangRen/Heterogeneous-Deep-Graph-Infomax
Subg-Con [19] https://github.com/yzjiao/Subg-Con
Cotext Prediction [11] http://snap.stanford.edu/gnn-pretrain
GIC [48] https://github.com/cmavro/Graph-InfoClust-GIC
GraphLoG [96] https://openreview.net/forum?id=DAaaaqPv9-q
MHCN [98] https://github.com/Coder-Yu/RecQ
EGI [36] https://openreview.net/forum?id=J_pvI6ap5Mn
MICRO-Graph [39] https://drive.google.com/file/d/1b751rpnV-SDmUJvKZZI-AvpfEa9eHxo9
InfoGraph [49] https://github.com/fanyun-sun/InfoGraph
SUGAR [84] https://github.com/RingBDStack/SUGAR
BiGI [37] https://github.com/clhchtcjj/BiNE
HTC [99] N.A.
Node Property Prediction [10] https://github.com/ChandlerBang/SelfTask-GNN
S2GRL\text{S}^{2}\text{GRL} [59]
N.A.
PairwiseDistance [10] https://github.com/ChandlerBang/SelfTask-GNN
PairwiseAttrSim [10] https://github.com/ChandlerBang/SelfTask-GNN
Distance2Cluster [10] https://github.com/ChandlerBang/SelfTask-GNN
EdgeMask [10] https://github.com/ChandlerBang/SelfTask-GNN
TopoTER [101] N.A.
Centrality Score Ranking [17] N.A.
Meta-path Prediction [69] https://github.com/mlvlab/SELAR
SLiCE [70] https://github.com/pnnl/SLICE
Distance2Labeled [10] https://github.com/ChandlerBang/SelfTask-GNN
ContextLabel [10] https://github.com/ChandlerBang/SelfTask-GNN
HCM [106] N.A.
Contextual Molecular
Property Prediction [76]
https://github.com/tencent-ailab/grover
Graph-level Motif Prediction [76] https://github.com/tencent-ailab/grover
Multi-stage Self-training [71] https://github.com/Davidham3/deeper_insights_into_GCNs
Node Clustering [12] https://github.com/Shen-Lab/SS-GCNs
Graph Partitioning [12] https://github.com/Shen-Lab/SS-GCNs
CAGNN [74] N.A.
M3S [75] https://github.com/datake/M3S
Cluster Preserving [17] N.A.
TABLE A7: A summary of node classification accuracy (%) of different methods on 8 commonly used datasets. When the code is not publicly available, or it is not impractical to run the released code from scratch, we replace the corresponding results with a dash “-”.
Cora Citeseer Pubmed Wiki-CS
Amazon-
Computers
Amazon-Photo Coauthor-CS
Coauthor-
Physics
GRACE [26] 80.00 71.70 79.50 78.19 87.46 92.15 92.93 95.26
GMI [16] 83.00 72.40 79.90 74.85 82.21 90.68 - -
Adjacency Matrix
Reconstruction [14]
83.80 72.95 81.23 - - - - -
Graph Bert [15] 84.30 71.20 79.30 - 87.57 - 92.99 95.41
CG3\text{C}\text{G}^{3} [92] 83.40 73.60 80.20 - 88.06 92.21 92.30 -
S2GRL\text{S}^{2}\text{GRL} [59] 83.70 72.10 82.40 78.58 88.49 - 92.66 95.24
PairwiseDistance [10] 83.11 71.90 80.05 - - - - -
PairwiseAttrSim [10] 83.05 71.67 79.45 - - - - -
Contrast-Reg [29] 82.65 72.98 80.10 77.13 84.93 91.09 - 94.38
Distance2Cluster [10] 83.55 71.44 79.88 - - - - -
TopoTER [101] 83.70 71.70 79.10 - - - 92.87 95.22
DGI [9] 82.30 71.80 76.80 75.35 83.95 91.61 92.15 94.51
MVGRL [18] 82.90 72.60 79.40 77.52 87.52 91.74 92.11 95.33
Subg-Con [19] 83.50 73.20 81.00 78.69 88.65 92.42 - -
Distance2Labeled [10] 83.39 71.64 79.51 - - - - -
ContextLabel [10] 82.76 72.59 82.31 - - - - -
GIC [48] 81.70 71.90 77.30 - 84.89 92.11 92.51 94.70
Node Property Prediction [10]
81.94 71.60 79.44 - - - - -
GCA [40] - - - 78.35 88.94 92.53 93.10 95.75
M3S [75] 81.60 71.94 79.28 - - - - -
KS2L [83] 84.60 74.20 83.80 - 86.80 92.40 93.30 -
BGRL [27] - - - 79.36 89.68 92.87 93.21 95.56
SelfGNN [35] 85.30 72.30 82.70 - 88.80 93.80 92.90 95.50
TABLE A8: A summary of graph classification accuracy (%) of different methods on 7 commonly used datasets. When the code is not publicly available, or it is not impractical to run the released code from scratch, we replace the corresponding results with a dash “-”.
MUTAG PTC NCI1 RDT-B RDT-M5K IMDB-B IMDB-M
GraphCL [25] 86.80 - 77.87 89.53 55.99 71.14 51.83
IGSD [33] 90.20 61.40 75.40 - - 74.70 51.50
DACL [91] 87.51 63.59 - 85.11 53.20 73.98 50.78
LCC [45] 90.50 65.90 82.90 - - 76.10 52.40
CSSL [34] 88.95 - 80.09 84.21 53.76 - -
GCC [38] - - - 87.80 53.00 75.60 50.90
MVGRL [18] 89.70 62.50 - 84.50 54.03 74.20 51.20
InfoGraph [49] 89.01 61.65 - 82.50 53.46 73.03 49.69
SUGAR [84] 96.74 77.53 84.39 - - - -
HTC [99] 91.80 65.50 - 91.10 55.70 73.30 50.60
TopoTER [101] 89.25 64.59 - 84.93 55.52 73.46 49.68
Refer to caption
Figure A1: Key milestones in the development of graph self-supervised learning. The generative, contrastive, and predictive learning methods are marked with red, orange, and green blocks, respectively. Best viewed in color.

References

  • [1] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” arXiv preprint arXiv:2006.08218, vol. 1, no. 2, 2020.
  • [2] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  • [4] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020.
  • [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
  • [8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” arXiv preprint arXiv:1706.02216, 2017.
  • [9] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” in ICLR (Poster), 2019.
  • [10] W. Jin, T. Derr, H. Liu, Y. Wang, S. Wang, Z. Liu, and J. Tang, “Self-supervised learning on graphs: Deep insights and new direction,” arXiv preprint arXiv:2006.10141, 2020.
  • [11] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, “Strategies for pre-training graph neural networks,” arXiv preprint arXiv:1905.12265, 2019.
  • [12] Y. You, T. Chen, Z. Wang, and Y. Shen, “When does self-supervision help graph convolutional networks?” in International Conference on Machine Learning.   PMLR, 2020, pp. 10 871–10 880.
  • [13] F. Manessi and A. Rozza, “Graph-based neural network models with multiple self-supervised auxiliary tasks,” arXiv preprint arXiv:2011.07267, 2020.
  • [14] Q. Zhu, B. Du, and P. Yan, “Self-supervised training of graph convolutional networks,” arXiv preprint arXiv:2006.02380, 2020.
  • [15] J. Zhang, H. Zhang, C. Xia, and L. Sun, “Graph-bert: Only attention is needed for learning graph representations,” arXiv preprint arXiv:2001.05140, 2020.
  • [16] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, “Graph representation learning via graphical mutual information maximization,” in Proceedings of The Web Conference 2020, 2020, pp. 259–270.
  • [17] Z. Hu, C. Fan, T. Chen, K.-W. Chang, and Y. Sun, “Pre-training graph neural networks for generic structural feature extraction,” arXiv preprint arXiv:1905.13728, 2019.
  • [18] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4116–4126.
  • [19] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu, “Sub-graph contrast for scalable self-supervised graph representation learning,” arXiv preprint arXiv:2009.10273, 2020.
  • [20] Y. Xie, Z. Xu, Z. Wang, and S. Ji, “Self-supervised learning of graph neural networks: A unified review,” arXiv preprint arXiv:2102.10757, 2021.
  • [21] Y. Liu, S. Pan, M. Jin, C. Zhou, F. Xia, and P. S. Yu, “Graph self-supervised learning: A survey,” arXiv preprint arXiv:2103.00111, 2021.
  • [22] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • [24] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, 2020.
  • [25] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [26] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Deep graph contrastive representation learning,” arXiv preprint arXiv:2006.04131, 2020.
  • [27] S. Thakoor, C. Tallec, M. G. Azar, R. Munos, P. Veličković, and M. Valko, “Bootstrapped representation learning on graphs,” arXiv preprint arXiv:2102.06514, 2021.
  • [28] F. L. Opolka, A. Solomon, C. Cangea, P. Veličković, P. Liò, and R. D. Hjelm, “Spatio-temporal deep graph infomax,” arXiv preprint arXiv:1904.06316, 2019.
  • [29] K. Ma, H. Yang, H. Yang, T. Jin, P. Chen, Y. Chen, B. F. Kamhoua, and J. Cheng, “Improving graph representation learning by contrastive regularization,” arXiv preprint arXiv:2101.11525, 2021.
  • [30] B. Jing, C. Park, and H. Tong, “Hdmi: High-order deep multiplex infomax,” arXiv preprint arXiv:2102.07810, 2021.
  • [31] Y. Ren, B. Liu, C. Huang, P. Dai, L. Bo, and J. Zhang, “Heterogeneous deep graph infomax,” arXiv preprint arXiv:1911.08538, 2019.
  • [32] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-gnn: Generative pre-training of graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1857–1867.
  • [33] H. Zhang, S. Lin, W. Liu, P. Zhou, J. Tang, X. Liang, and E. P. Xing, “Iterative graph self-distillation,” arXiv preprint arXiv:2010.12609, 2020.
  • [34] J. Zeng and P. Xie, “Contrastive self-supervised learning for graph classification,” arXiv preprint arXiv:2009.05923, 2020.
  • [35] Z. T. Kefato and S. Girdzijauskas, “Self-supervised graph neural networks without explicit negative sampling,” arXiv preprint arXiv:2103.14958, 2021.
  • [36] Q. Zhu, Y. Xu, H. Wang, C. Zhang, J. Han, and C. Yang, “Transfer learning of graph neural networks with ego-graph information maximization,” arXiv preprint arXiv:2009.05204, 2020.
  • [37] J. Cao, X. Lin, S. Guo, L. Liu, T. Liu, and B. Wang, “Bipartite graph embedding via mutual information maximization,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 635–643.
  • [38] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, “Gcc: Graph contrastive coding for graph neural network pre-training,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1150–1160.
  • [39] S. Zhang, Z. Hu, A. Subramonian, and Y. Sun, “Motif-driven contrastive learning of graph representations,” arXiv preprint arXiv:2012.12533, 2020.
  • [40] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Graph contrastive learning with adaptive augmentation,” arXiv preprint arXiv:2010.14945, 2020.
  • [41] P. Bonacich, “Power and centrality: A family of measures,” American journal of sociology, vol. 92, no. 5, pp. 1170–1182, 1987.
  • [42] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
  • [43] N. Jovanović, Z. Meng, L. Faber, and R. Wattenhofer, “Towards robust graph contrastive learning,” arXiv preprint arXiv:2102.13085, 2021.
  • [44] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
  • [45] Y. Ren, J. Bai, and J. Zhang, “Label contrastive coding based graph neural network for graph classification,” arXiv preprint arXiv:2101.05486, 2021.
  • [46] J. Yu, H. Yin, M. Gao, X. Xia, X. Zhang, and N. Q. V. Hung, “Socially-aware self-supervised tri-training for recommendation,” arXiv preprint arXiv:2106.03569, 2021.
  • [47] X. Wang, N. Liu, H. Han, and C. Shi, “Self-supervised heterogeneous graph neural network with co-contrastive learning,” arXiv preprint arXiv:2105.09111, 2021.
  • [48] C. Mavromatis and G. Karypis, “Graph infoclust: Leveraging cluster-level node information for unsupervised graph representation learning,” arXiv preprint arXiv:2009.06946, 2020.
  • [49] F.-Y. Sun, J. Hoffmann, V. Verma, and J. Tang, “Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization,” arXiv preprint arXiv:1908.01000, 2019.
  • [50] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in International Conference on Machine Learning.   PMLR, 2018, pp. 531–540.
  • [51] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,” arXiv preprint arXiv:1606.00709, 2016.
  • [52] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.
  • [53] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, pp. 1857–1865.
  • [54] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, “On mutual information maximization for representation learning,” arXiv preprint arXiv:1907.13625, 2019.
  • [55] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [56] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 403–412.
  • [57] M. Kemertas, L. Pishdad, K. G. Derpanis, and A. Fazly, “Rankmi: A mutual information maximizing ranking loss,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 362–14 371.
  • [58] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [59] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng, “Self-supervised graph representation learning via global context prediction,” arXiv preprint arXiv:2003.01604, 2020.
  • [60] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in In 5-th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
  • [61] Z. Gao, H. Lin, S. Li et al., “Clustering based on graph of density topology,” arXiv preprint arXiv:2009.11612, 2020.
  • [62] L. Wu, Z. Liu, Z. Zang, J. Xia, S. Li, S. Li et al., “Deep clustering and representation learning that preserves geometric structures,” arXiv preprint arXiv:2009.09590, 2020.
  • [63] S. Z. Li, L. Wu, and Z. Zang, “Consistent representation learning for high dimensional data analysis,” arXiv preprint arXiv:2012.00481, 2020.
  • [64] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4066–4075.
  • [65] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5147–5156.
  • [66] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016, pp. 478–487.
  • [67] R. McConville, R. Santos-Rodriguez, R. J. Piechocki, and I. Craddock, “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding,” arXiv preprint arXiv:1908.05968, 2019.
  • [68] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149.
  • [69] D. Hwang, J. Park, S. Kwon, K.-M. Kim, J.-W. Ha, and H. J. Kim, “Self-supervised auxiliary learning with meta-paths for heterogeneous graphs,” arXiv preprint arXiv:2007.08294, 2020.
  • [70] P. Wang, K. Agarwal, C. Ham, S. Choudhury, and C. K. Reddy, “Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks,” arXiv preprint arXiv:2007.11192, 2020.
  • [71] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  • [72] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
  • [73] V. A. Traag, L. Waltman, and N. J. Van Eck, “From louvain to leiden: guaranteeing well-connected communities,” Scientific reports, vol. 9, no. 1, pp. 1–12, 2019.
  • [74] Y. Zhu, Y. Xu, F. Yu, S. Wu, and L. Wang, “Cagnn: Cluster-aware graph neural networks for unsupervised graph representation learning,” arXiv preprint arXiv:2009.01674, 2020.
  • [75] K. Sun, Z. Lin, and Z. Zhu, “Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5892–5899.
  • [76] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, “Self-supervised graph transformer on large-scale molecular data,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [77] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
  • [78] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures from non-enzymes without alignments,” Journal of molecular biology, vol. 330, no. 4, pp. 771–783, 2003.
  • [79] M. Zitnik and J. Leskovec, “Predicting multicellular function through multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–i198, 2017.
  • [80] C. Park, D. Kim, J. Han, and H. Yu, “Unsupervised attributed multiplex network embedding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5371–5378.
  • [81] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” arXiv preprint arXiv:2103.03230, 2021.
  • [82] B. Hao, J. Zhang, H. Yin, C. Li, and H. Chen, “Pre-training graph neural networks for cold-start users and items representation,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 265–273.
  • [83] L. Yu, S. Pei, C. Zhang, L. Ding, J. Zhou, L. Li, and X. Zhang, “Self-supervised smoothing graph neural networks,” arXiv preprint arXiv:2009.00934, 2020.
  • [84] Q. Sun, H. Peng, J. Li, J. Wu, Y. Ning, P. S. Yu, and L. He, “Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism,” arXiv preprint arXiv:2101.08170, 2021.
  • [85] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [86] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” arXiv preprint arXiv:2010.01028, 2020.
  • [87] J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” arXiv preprint arXiv:2010.04592, 2020.
  • [88] J. Shang, T. Ma, C. Xiao, and J. Sun, “Pre-training of graph augmented transformers for medication recommendation,” arXiv preprint arXiv:1906.00346, 2019.
  • [89] J. Zhang, K. Chen, and Y. Wang, “Pre-training on dynamic graph neural networks,” arXiv preprint arXiv:2102.12380, 2021.
  • [90] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
  • [91] V. Verma, M.-T. Luong, K. Kawaguchi, H. Pham, and Q. V. Le, “Towards domain-agnostic contrastive learning,” arXiv preprint arXiv:2011.04419, 2020.
  • [92] S. Wan, S. Pan, J. Yang, and C. Gong, “Contrastive and generative graph convolutional networks for graph-based semi-supervised learning,” arXiv preprint arXiv:2009.07111, 2020.
  • [93] B. Chen, J. Zhang, X. Zhang, X. Tang, L. Cai, H. Chen, C. Li, P. Zhang, and J. Tang, “Coad: Contrastive pre-training with adversarial fine-tuning for zero-shot expert linking,” arXiv preprint arXiv:2012.11336, 2020.
  • [94] T. Kipf, E. van der Pol, and M. Welling, “Contrastive learning of structured world models,” arXiv preprint arXiv:1911.12247, 2019.
  • [95] S. Cheng, L. Zhang, B. Jin, Q. Zhang, and X. Lu, “Drug target prediction using graph representation learning via substructures contrast,” 2021.
  • [96] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang, “Self-supervised graph-level representation learning with local and global structure,” 2021. [Online]. Available: https://openreview.net/forum?id=DAaaaqPv9-q
  • [97] L. Xu, A. Krzyzak, and E. Oja, “Rival penalized competitive learning for clustering analysis, rbf net, and curve detection,” IEEE Transactions on Neural networks, vol. 4, no. 4, pp. 636–649, 1993.
  • [98] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X. Zhang, “Self-supervised multi-channel hypergraph convolutional network for social recommendation,” arXiv preprint arXiv:2101.06448, 2021.
  • [99] C. Wang and Z. Liu, “Graph representation learning by ensemble aggregating subgraphs via mutual information maximization,” arXiv preprint arXiv:2103.13125, 2021.
  • [100] B. Fatemi, L. E. Asri, and S. M. Kazemi, “Slaps: Self-supervision improves structure learning for graph neural networks,” arXiv preprint arXiv:2102.05034, 2021.
  • [101] X. Gao, W. Hu, and G.-J. Qi, “Topo{ter}: Unsupervised learning of topology transformation equivariant representations,” 2021. [Online]. Available: https://openreview.net/forum?id=9az9VKjOx00
  • [102] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919.
  • [103] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, pp. 93–93, 2008.
  • [104] A. Sehanobish, N. G. Ravindra, and D. van Dijk, “Self-supervised edge features for improved graph neural network training,” arXiv preprint arXiv:2007.04777, 2020.
  • [105] M. Yasunaga and P. Liang, “Graph-based, self-supervised program repair from diagnostic feedback,” in International Conference on Machine Learning.   PMLR, 2020, pp. 10 799–10 808.
  • [106] T. Huang, Y. Pei, V. Menkovski, and M. Pechenizkiy, “Hop-count based self-supervised anomaly detection on attributed networks,” arXiv preprint arXiv:2104.07917, 2021.
  • [107] C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proceedings of the third ACM conference on Digital libraries, 1998, pp. 89–98.
  • [108] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000.
  • [109] P. Mernyei and C. Cangea, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv preprint arXiv:2007.02901, 2020.
  • [110] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
  • [111] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” arXiv preprint arXiv:2005.00687, 2020.
  • [112] J. Li, X. Hu, J. Tang, and H. Liu, “Unsupervised streaming feature selection in social media,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1041–1050.
  • [113] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, pp. 1–19, 2015.
  • [114] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research, vol. 12, no. 9, 2011.
  • [115] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel, “Protein function prediction via graph kernels,” Bioinformatics, vol. 21, no. suppl_1, pp. i47–i56, 2005.
  • [116] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classification,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, 2008.
  • [117] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity,” Journal of medicinal chemistry, vol. 34, no. 2, pp. 786–797, 1991.
  • [118] N. Kriege and P. Mutzel, “Subgraph matching kernels for attributed graphs,” arXiv preprint arXiv:1206.6483, 2012.
  • [119] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data, vol. 1, no. 1, pp. 1–7, 2014.
  • [120] I. F. Martins, A. L. Teixeira, L. Pinheiro, and A. O. Falcao, “A bayesian approach to in silico blood-brain barrier penetration modeling,” Journal of chemical information and modeling, vol. 52, no. 6, pp. 1686–1697, 2012.
  • [121] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “Deeptox: toxicity prediction using deep learning,” Frontiers in Environmental Science, vol. 3, p. 80, 2016.
  • [122] R. Huang, M. Xia, D.-T. Nguyen, T. Zhao, S. Sakamuru, J. Zhao, S. A. Shahane, A. Rossoshek, and A. Simeonov, “Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs,” Frontiers in Environmental Science, vol. 3, p. 85, 2016.
  • [123] A. M. Richard, R. S. Judson, K. A. Houck, C. M. Grulke, P. Volarath, I. Thillainadarajah, C. Yang, J. Rathman, M. T. Martin, J. F. Wambaugh et al., “Toxcast chemical landscape: paving the road to 21st century toxicology,” Chemical research in toxicology, vol. 29, no. 8, pp. 1225–1251, 2016.
  • [124] P. A. Novick, O. F. Ortiz, J. Poelman, A. Y. Abdulhay, and V. S. Pande, “Sweetlead: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery,” PloS one, vol. 8, no. 11, p. e79568, 2013.
  • [125] E. J. Gardiner, J. D. Holliday, C. O’Dowd, and P. Willett, “Effectiveness of 2d fingerprints for scaffold hopping,” Future medicinal chemistry, vol. 3, no. 4, pp. 405–414, 2011.
  • [126] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork, “The sider database of drugs and side effects,” Nucleic acids research, vol. 44, no. D1, pp. D1075–D1079, 2016.
  • [127] G. Subramanian, B. Ramsundar, V. Pande, and R. A. Denny, “Computational modeling of β\beta-secretase 1 (bace-1) inhibitors using ligand based approaches,” Journal of chemical information and modeling, vol. 56, no. 10, pp. 1936–1949, 2016.
  • [128] K. Riesen and H. Bunke, “Iam graph database repository for graph based pattern recognition and machine learning,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).   Springer, 2008, pp. 287–297.
  • [129] J. Kazius, R. McGuire, and R. Bursi, “Derivation and validation of toxicophores for mutagenicity prediction,” Journal of medicinal chemistry, vol. 48, no. 1, pp. 312–320, 2005.
  • [130] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [131] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  • [132] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its technical challenges,” Communications of the ACM, vol. 57, no. 7, pp. 86–94, 2014.