Self-supervised Learning on Graphs: Contrastive, Generative,or Predictive

Lirong Wu, Haitao Lin, Cheng Tan,
Zhangyang Gao, and Stan.Z.Li

Abstract

Deep learning on graphs has recently achieved remarkable success on a variety of tasks, while such success relies heavily on the massive and carefully labeled data. However, precise annotations are generally very expensive and time-consuming. To address this problem, self-supervised learning (SSL) is emerging as a new paradigm for extracting informative knowledge through well-designed pretext tasks without relying on manual labels. In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive. More importantly, unlike other surveys that only provide a high-level description of published research, we present an additional mathematical summary of existing works in a unified framework. Furthermore, to facilitate methodological development and empirical comparisons, we also summarize the commonly used datasets, evaluation metrics, downstream tasks, open-source implementations, and experimental study of various algorithms. Finally, we discuss the technical challenges and potential future directions for improving graph self-supervised learning. Latest advances in graph SSL are summarized in a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning.

Index Terms:

Deep Learning, Self-supervised Learning, Graph Neural Networks, Unsupervised Learning, Survey.

1 Introduction

In recent years, deep learning on graphs has emerged as a popular research topic, due to its ability to capture both graph structures and node/edge features. However, most of the works have focused on supervised or semi-supervised learning settings, where the model is trained by specific downstream tasks with abundant labeled data, which are often limited, expensive, and inaccessible. Due to the heavy reliance on the number and quality of labels, these methods are hardly applicable to real-world scenarios, especially those requiring expert knowledge for annotation, such as medicine, meteorology, transportation, etc. More importantly, these methods are prone to suffer from problems of over-fitting, poor generalization, and weak robustness [1].

Recent advances in SSL [2, 3, 4, 5, 6, 7] have provided novel insights into reducing the dependency on annotated labels and enable the training on massive unlabeled data. The primary goal of SSL is to learn transferable knowledge from abundant unlabeled data with well-designed pretext tasks and then generalize the learned knowledge to downstream tasks with specific supervision signals. Recently, SSL has shown its promising capability in the field of computer vision (CV) [2, 3, 4] and natural language processing (NLP) [5, 6, 7]. For example, Moco [2] and SimCLR [3] augment image data by various means and then train Convolutional Neural Networks (CNNs) to capture dependencies between different augmentations. Besides, BERT [5] pre-trains the model with Masked LM and Next Sentence Prediction as pretext tasks, achieving state-of-the-art performance on up to 11 tasks compared to those conventional methods. Inspired by the success of SSL in CV and NLP domains, it is a promising direction to apply SSL to the graph domain to fully exploit graph structure information and rich unlabeled data.

Compared with image and language sequence data, applying SSL to graph data is very important and has promising research prospects. Firstly, Firstly, along with node features, graph data also contains structure, where extensive pretext tasks can be designed to capture the intrinsic relations of nodes. Secondly, real-world graphs are usually formed by specific rules, e.g., links between atoms in molecular graphs are bounded by valence bond theory. Thus, extensive expert knowledge can be incorporated as a priori into the design of pretext tasks. Finally, graph data generally supports transductive learning [8], such as node classification, where the features of Train/Val/Test samples are all available during the training process, which makes it possible to design more feature-related pretext tasks.

Refer to caption — (a) Contrastive Method

Though some works have been proposed recently to apply SSL to graph data and have achieved remarkable success [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], it is still very challenging to deal with the inherent differences between grid-like and structured-like data. Firstly, the topology of the image is a fixed grid, and the text is a simple sequence, while graphs are not restricted to these rigid structures. Secondly, unlike the assumption of independent and identical sample distribution for image and text data, nodes in the graph are correlated with each other rather than completely independent. This requires us to design pretext tasks by considering both node attributes and graph structures. Finally, there exists a gap between self-supervised pretext tasks and downstream tasks due to the divergence of their optimization objectives. Inevitably, such divergence will significantly hurt the generalization of learned models. Therefore, it is crucial to reconsider the objectives of pretext tasks to better match that of downstream tasks and make them consistent with each other.

In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of the existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive, as shown in Fig. 1. The core contributions of this survey are as follow:

•

Present comprehensive and up-to-date reviews on existing graph SSL methods and divide them into three categories: contrastive, generative, and predictive, to improve their clarity and accessibility.
•

Summarize the core mathematical ideas of recent research in graph SSL within a unified framework.
•

Summarize the commonly used datasets, evaluation metrics, downstream tasks, open-source codes, and experimental study of surveyed methods, setting the stage for developments of future works.
•

Point out the technical limitations of current research and discuss promising directions on graph SSL.

Compared to the existing surveys on SSL [1], we purely focus on SSL for graph data and present a more mathematical review on the recent progress from the year 2019 to 2021. Though there have been two surveys on graph SSL, we argue that they are immature work with various flaws and shortcomings. For example, [20] clumsily describes each method in 1-2 sentences, lacking deep insight into the mathematical ideas and implementation details behind. Moreover, the number of reviewed methods in [21] are even fewer than half of ours, as it spends too much description on those less important contents, but ignores the core of graph SSL, i.e., the design of the pretext task. Compared with [20, 21], our advantages are as follow: (1) more mathematical details, striving to summarize each method with one equation; (2) more implementation details, including 41 datasets statistics (vs 20 datasets in [20]), evaluation metrics, and open-source code; (3) more thorough experimental study for fair comparison; (4) more fine-grained, clarified and rational taxonomy; (5) more surveyed works, 71 methods (vs 47 methods in [21] vs 18 methods in [20]); (5) more up-to-data review, with almost all surveyed works published after 2019.

2 Problem Statement

2.1 Notions and Definitions

Unless particularly specified, the notations used in this survey are illustrated in Table. I.

Definition 1 (Graph): We use $g=(\mathcal{V},\mathcal{E})$ to denote a graph where $\mathcal{V}$ is the set of $N$ nodes and $\mathcal{E}$ is the set of $M$ edges. Let $v_{i}\in\mathcal{V}$ denote a node and $e_{i,j}$ denote an edge between node $v_{i}$ and $v_{j}$ . The $l$ -hop neighborhood of a node $v_{i}$ is denoted as $\mathcal{N}_{i}^{(l)}=\{v_{j}\in\mathcal{V}|d(v_{i},v_{j})\leq l\}$ where $d(v_{i},v_{j})$ is the shortest path length between node $v_{i}$ and $v_{j}$ . In particular, the $1$ -hop neighborhood of a node $v_{i}$ is denoted as $\mathcal{N}_{i}=\mathcal{N}_{i}^{(1)}=\{v_{j}\in\mathcal{V}|e_{i,j}\in\mathcal{E}\}$ . The graph structure can also be represented by an adjacency matrix $\mathbf{A}\in[0,1]^{N\times N}$ with $\mathbf{A}_{i,j}=1$ if $e_{i,j}\in\mathcal{E}$ and $\mathbf{A}_{i,j}=0$ if $e_{i,j}\notin\mathcal{E}$ .

Definition 2 (Attribute Graph): Attributed graph, an opposite concept to the unattributed one, refers to a graph where nodes or edges are associated with their own features (a.k.a, attributes). For example, each node $v_{i}$ in graph $g$ may be associated with a feature vector $\mathbf{x}_{i}\in\mathbb{R}^{d_{0}}$ , such a graph is referred to an attributed graph $g=(\mathcal{V},\mathcal{E},\mathbf{X})$ or $g=(\mathbf{A},\mathbf{X})$ , where $\mathbf{X}=\left[\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{N}\right]$ is the node feature matrix. Meanwhile, an attributed graph $g=(\mathcal{V},\mathcal{E},\mathbf{X}^{e})$ may have edge attributes $\mathbf{X}^{e}$ , where $\mathbf{X}^{e}\in\mathbb{R}^{M\times b_{0}}$ is an edge feature matrix with $\mathbf{x}^{e}_{i,j}\in\mathbb{R}^{b_{0}}$ being the feature vector of edge $e_{i,j}$ .

Definition 3 (Dynamic Graph): A dynamic graph is a special attributed graph where the node set, graph structure and node attributes may change dynamically over time. The dynamic graph can be formalized as $g=(\mathcal{V}^{(t)},\mathcal{E}^{(t)},\mathbf{X}^{(t)})$ or $g=(\mathbf{A}^{(t)},\mathbf{X}^{(t)})$ , where $\mathcal{E}^{(t)}$ represents the edge set at the time step $t$ and $\mathbf{A}^{(t)}_{i,j}=1$ denotes an interaction between node $v_{i}$ and $v_{j}$ at the time step $t$ ( $1\leq t\leq T$ ).

Definition 4 (Heterogeneous Graph): Consider a graph $g=(\mathcal{V},\mathcal{E})$ with a node type mapping function $f_{v}:\mathcal{V}\rightarrow\mathcal{Y}^{v}$ and an edge type mapping function $f_{e}:\mathcal{E}\rightarrow\mathcal{Y}^{e}$ , where $\mathcal{Y}^{v}$ is the set of node types and $\mathcal{Y}^{e}$ is the set of edge types. For a graph with more than one type of node or edge, e.g., $\left|\mathcal{Y}^{v}\right|>1$ or $\left|\mathcal{Y}^{e}\right|>1$ , we define it as a heterogeneous graph, otherwise, it is a homogeneous graph. There are some special types of heterogeneous graphs: a bipartite graph with $\left|\mathcal{Y}^{v}\right|=2$ and $\left|\mathcal{Y}^{e}\right|=1$ , and a multiplex graph with $\left|\mathcal{Y}^{v}\right|=1$ and $\left|\mathcal{Y}^{e}\right|>1$ .

Definition 5 (Spatial-Temporal Graph): A spatial-temporal graph is a special dynamic graph, but noly the node attributes change over time with the node set and graph structure unchanged. The spatial-temporal graph is defined as $g=(\mathcal{V},\mathcal{E},\mathbf{X}^{(t)})$ or $g=(\mathbf{A},\mathbf{X}^{(t)})$ , where $\mathbf{X}^{(t)}\in\mathbb{R}^{N\times d_{0}}$ is the node feature matrix at the time step $t$ ( $1\leq t\leq T$ ).

2.2 Downstream Tasks

The downstream tasks for graph data are divided into three categories: node-level, link-level, and graph-level tasks. A node-level graph encoder $f_{\theta}(\cdot)$ is often used to generate node embeddings for each node, and a graph-level graph encoder $f_{\gamma}(\cdot)$ is used to generate graph-level embeddings. Finally, the learned embeddings are fed into an optional prediction head $g_{\omega}(\cdot)$ to perform specific downstream tasks.

Node-level tasks. Node-level tasks focus on the properties of nodes, and node classification is a typical node-level task where only a subset of node $\mathcal{V}_{L}$ with corresponding labels $\mathcal{Y}_{L}$ are known, and we denote the labeled data as $\mathcal{D}_{L}=(\mathcal{V}_{L},\mathcal{Y}_{L})$ and unlabeled data as $\mathcal{D}_{U}=(\mathcal{V}_{U},\mathcal{Y}_{U})$ . Let $f_{\theta}:\mathcal{V}\rightarrow\mathcal{Y}$ be a graph encoder trained on labeled data $\mathcal{D}_{L}$ so that it can be used to infer the labels $\mathcal{Y}_{U}$ of unlabeled data. Thus, the objective function for node classification can be defined as minimizing loss $\mathcal{L}_{node}$ , as follows

\displaystyle\min_{\theta,\omega}\mathcal{L}_{node}

\displaystyle\left(\mathbf{A},\mathbf{X},\theta,\omega\right)=\sum_{(v_{i},y_{i})\in\mathcal{D}_{L}}\ell\big{(}g_{\omega}(\mathbf{h}_{i}),y_{i}\big{)}

(1)

where $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ and $\mathbf{h}_{i}$ is the embedding of node $v_{i}$ in the embedding matrix $\mathbf{H}$ . $\ell(\cdot,\cdot)$ denotes the cross entropy.

Link-level tasks. Link-level tasks focus on the representation of node paris or properties of edges. Taking link prediction as an example, given two nodes, the goal is to discriminate if there is an edge between them. Thus, the objective function for link prediction can be defined as,

\begin{split}\min_{\theta,\omega}\mathcal{L}_{link}\left(\mathbf{A},\mathbf{X},\theta,\omega\right)=\sum_{v_{i},v_{j}\in\mathcal{V},i\neq j}\ell\big{(}g_{\omega}(\mathbf{h}_{i},\mathbf{h}_{j}),\mathbf{A}_{i,j}\big{)}\end{split}

(2)

where $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ and $\mathbf{h}_{i}$ is the embedding of node $v_{i}$ in the embedding matrix $\mathbf{H}$ . $g_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value, and $\ell(\cdot,\cdot)$ is the cross entropy.

TABLE I: Notations used in this paper.

Notations	Descriptions
$\mathbb{R}^{m}$	$m$ -dimensional Euclidean space
$a,\mathbf{a},\mathbf{A}$	Scalar, vector, matrix
$\mathcal{G}$	The set of graphs, $\mathcal{G}=\{g_{1},g_{2},\cdots,g_{\|\mathcal{G}\|}\}$
$g$	A graph $g=(\mathcal{V},\mathcal{E})$
$\mathcal{V}$	The set of nodes in graph $g$
$v_{i}$	A node $v_{i}\in\mathcal{V}$
$\mathcal{E}$	The set of edges in graph $g$
$e_{i,j}$	An edge $e_{i,j}\in\mathcal{E}$ between node $v_{i}$ and node $v_{j}$
$N$	Number of nodes, $N=\|\mathcal{V}\|$
$M$	Number of edges, $M=\|\mathcal{E}\|$
$\mathbf{A}\in\mathbb{R}^{N\times N}$	A graph adjacency matrix
$\mathbf{D}$	The degree matrix of $\mathbf{A}$ , $\mathbf{D}_{ii}=\sum_{j=1}^{n}\mathbf{A}_{ij}$
$\mathbf{I}_{N}$	Identity matrix of dimension $N$
$\mathcal{N}_{i}^{(l)}$	$l$ -hop Neighborhood set of node $v_{i}$
$\mathcal{N}_{i}$	$1$ -hop Neighborhood set of node $v_{i}$
$L$	The layer number
$l$	The layer index, $1\leq l\leq L$
$T$	The time step/iteration number
$t$	The time step/iteration index, $1\leq t\leq T$
$d_{0}$	Dimension of node feature vectors
$d_{l}$	Dimension of node embeddings in the $l$ -th layer
$b_{0}$	Dimension of edge feature vectors
$\mathbf{x}_{i}\in\mathbb{R}^{d_{0}}$	Feature vector of node $v_{i}$
$\mathbf{X}\in\mathbb{R}^{N\times d_{0}}$	Node feature matrix, $\mathbf{X}=\left[\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{N}\right]$
$\mathbf{X}^{(t)}\in\mathbb{R}^{N\times d_{0}}$	Node feature matrix at the time step $t$
$\mathbf{x}^{e}_{i,j}\in\mathbb{R}^{b_{0}}$	Feature vector of edge $e_{i,j}$
$\mathbf{X}^{e}\in\mathbb{R}^{M\times b_{0}}$	Edge feature matrix
$\mathbf{h}_{i}^{(l)}\in\mathbb{R}^{d_{l}}$	Node embedding of node $v_{i}$ in the $l$ -th layer
$\mathbf{H}^{(l)}\in\mathbb{R}^{N\times d_{l}}$	Embedding matrix in the $l$ -th layer
$\mathbf{h}_{i}\in\mathbb{R}^{d_{L}}$	Node embedding in the $L$ -th layer, $\mathbf{h}_{i}=\mathbf{h}_{i}^{(L)}$
$\mathbf{H}\in\mathbb{R}^{N\times d_{L}}$	Embedding matrix in the $L$ -th layer, $\mathbf{H}=\mathbf{H}^{(L)}$
$\mathbf{h}_{g}\in\mathbb{R}^{d_{L}}$	Graph-level representation of graph $g$
$\|\cdot\|$	The length of a set
$\odot$	Element-wise multiplication operation
$\\|$	Vector concatenation
$\sigma(\cdot)$	The logistic sigmoid activation function
$\tanh(\cdot)$	The hyperbolic tangent activation function
LeakyReLU $(\cdot)$	The LeakyReLU activation function
$\operatorname{READOUT}(\cdot)$	The readout function
$f_{\theta},f_{\theta_{1}},f_{\theta_{2}},\cdots$	Node-level encoder to output $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$
$f_{\gamma},f_{\gamma_{1}},f_{\gamma_{2}},\cdots$	Graph-level encoder to output $\mathbf{h}_{g}=f_{\gamma}(\mathbf{A},\mathbf{X})$
$g_{\omega},g_{\omega_{1}},g_{\omega_{2}},\cdots$	The prediction head
$\mathcal{T},\mathcal{T}_{1},\mathcal{T}_{2},\cdots$	The data augmentation transformation
$\mathbf{W},\Theta,\theta,\gamma,\omega$	Learnable model parameters

Graph-level tasks. Graph-level tasks learn from multiple graphs in a dataset and predict the property of a single graph. For example, graph regression is a typical graph-level task where only a subset of graphs $\mathcal{G}_{L}$ with corresponding properties $\mathcal{P}_{L}$ are known, and we denote it as $\mathcal{D}_{L}=(\mathcal{G}_{L},\mathcal{P}_{L})$ . Let $f_{\gamma}:\mathcal{G}\rightarrow\mathcal{P}$ be a graph encoder trained on labeled data $\mathcal{D}_{L}$ and then used to infer the properties $\mathcal{P}_{U}$ of unlabeled graphs $\mathcal{G}_{U}$ . Thus, the objective function for graph regression can be defined as minimizing loss $\mathcal{L}_{graph}$ ,

\min_{\gamma,\omega}\mathcal{L}_{graph}\left(\mathbf{A}_{i},\mathbf{X}_{i},\gamma,\omega\right)=\sum_{\left(g_{i},p_{i}\right)\in\mathcal{D}_{L}}\ell\big{(}g_{\omega}(\mathbf{h}_{g_{i}}),p_{i}\big{)}

(3)

where $\mathbf{h}_{g_{i}}=f_{\gamma}(\mathbf{A},\mathbf{X})$ is the graph-level representation of graph $g_{i}$ . $g_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value, and $\ell(\cdot,\cdot)$ is the mean absolute error.

2.3 Graph Neural Networks

Graph neural networks (GNN) [22, 23, 8] are a family of neural networks that have been widely used as the backbone encoder in most of the reviewed works. A general GNN framework involves two key computations for each node $v_{i}$ at every layer: (1) $\operatorname{AGGREGATE}$ operation: aggregating messages from neighborhood $\mathcal{N}_{i}$ ; (2) $\operatorname{UPDATE}$ operation: updating node representation from its representation in the previous layer and the aggregated messages. Considering a $L$ -layer GNN, the formulation of the $l$ -th layer is as follows

	$\displaystyle\mathbf{a}_{i}^{(l)}$	$\displaystyle=\operatorname{AGGREGATE}^{(l)}\left(\left\{\mathbf{h}_{j}^{(l-1)}:v_{j}\in\mathcal{N}_{i}\right\}\right)$		(4)
	$\displaystyle\mathbf{h}_{i}^{(l)}$	$\displaystyle=\operatorname{UPDATE}^{(l)}\left(\mathbf{h}_{i}^{(l-1)},\mathbf{a}_{i}^{(l)}\right)$		(4)

where $1\leq l\leq L$ and $\mathbf{h}_{i}^{(l)}$ is the embedding of node $v_{i}$ in the $l$ -th layer with $\mathbf{h}_{i}^{(0)}=\mathbf{x}_{i}$ . For node-level or edge-level tasks, the node representation $\mathbf{h}_{i}^{(L)}$ can sometimes be used for downstream tasks directly. However, for graph-level tasks, an extra $\operatorname{READOUT}$ function is required to aggregate node features to obtain a graph-level representation $\mathbf{h}_{g}$ , as follow

\displaystyle\mathbf{h}_{g}=\operatorname{READOUT}\left(\left\{\mathbf{h}_{i}^{(L)}\mid v_{i}\in\mathcal{V}\right\}\right)

(5)

The design of these component functions is crucial, but it is beyond the scope of this paper. For a thorough review, we refer readers to the recent survey [24].

2.4 Training Strategy

The training strategies can be divided into three categories: Pre-training and Fine-tuning (P&F), Joint Learning (JL), and Unsupervised Representation Learning (URL), with their detailed workflow shown in Fig. 2.

Pre-training and Fine-tuning (P&F). In this strategy, the model is trained in a two-stage paradigm [10]. The encoder $f_{\theta}(\cdot)$ is pre-trained with the pretext tasks, then the pre-trained parameters $\theta_{init}$ are used as the initialization of the encoder $f_{\theta_{init}}(\cdot)$ . At the fine-tuning stage, the pre-trained encoder $f_{\theta_{init}}(\cdot)$ is fine-tuned with a prediction head $g_{\omega}(\cdot)$ under the supervision of specific downstream tasks. The learning objective is formulated as

\theta^{*},\omega^{*}=\arg\min_{(\theta,\omega)}\mathcal{L}_{task}(f_{\theta_{init}},g_{\omega})

(6)

with initialization $\theta_{init}=\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta})$ , where $\mathcal{L}_{task}$ and $\mathcal{L}_{ssl}$ is the loss function of downstream tasks and self-supervised pretext tasks, respectively.

Joint Learning. In this scheme, the encoder $f_{\theta}(\cdot)$ is jointly trained with a prediction head $g_{\omega}(\cdot)$ under the supervision of pretext tasks and downstream tasks. The joint learning strategy can also be considered as a kind of multi-task learning or the pretext tasks are served as a regularization. The learning objective is formulated as

\theta^{*},\omega^{*}=\arg\min_{(\theta,\omega)}\mathcal{L}_{task}(f_{\theta},g_{\omega})+\alpha\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta})

(7)

where $\alpha$ is a trade-off hyperparameter.

Unsupervised Representation Learning. This strategy can also be considered as a two-stage paradigm, with the first stage similar to Pre-training. However, at the second stage, the pre-trained parameters $\theta_{init}$ are frozen and the model is trained on the frozen representations with downstream tasks only. The learning objective is formulated as

\omega^{*}=\arg\min_{\omega}\mathcal{L}_{task}(f_{\theta_{init}},g_{\omega})

(8)

with initialization $\theta_{init}=\arg\min_{\theta}\mathcal{L}_{ssl}(f_{\theta})$ . Compared to other schemes, unsupervised representation learning is more challenging since the learning of the encoder $f_{\theta}(\cdot)$ depends only on the pretext task and is frozen in the second stage. In contrast, in the P&F strategy, the encoder $f_{\theta}(\cdot)$ can be further optimized under the supervision of the downstream task during the fine-tuning stage.

3 Contrastive Learning

3.1 A Unified Perspective

Inspired by the recent advances of contrastive learning in CV and NLP domains, some works have been proposed to apply contrastive learning for graph data. However, most works simply present motivations or implementations from different perspectives, but adopt very similar (or even the same) architectures and designs in practice, which leads to the emergence of duplicative efforts and hinders the healthy development of the community. In this survey, we therefore review existing work from a unified perspective and unify them into a general framework, and present various designs for the three main modules for contrastive learning, e.g., data augmentation, pretext tasks, and contrastive objectives. In turn, the contributions of existing work can be essentially summarized as innovations in these three modules.

In practice, we usually generate multiple views for each instance through various data augmentations. Two views generated from the same instance are usually considered as a positive pair, while two views generated from different instances are considered as a negative pair. The primary goal of contrastive learning is to maximize the agreement of two jointly sampled positive pairs against the agreement of two independently sampled negative pairs. The agreement between views is usually measured through Mutual Information (MI) estimation. Given a graph $g=(\mathbf{A},\mathbf{X})$ , $K$ different transformations $\mathcal{T}_{1},\mathcal{T}_{1},\cdots,\mathcal{T}_{K}$ can be applied to obtain multiple views $\{(\mathbf{A}_{k},\mathbf{X}_{k})\}_{k=1}^{K}$ , defined as

\mathbf{A}_{k},\mathbf{X}_{k}=\mathcal{T}_{k}(\mathbf{A},\mathbf{X});k=1,2,\cdots,K

(9)

Secondly, a set of graph encoders $\{f_{\theta_{k}}\}_{k=1}^{K}$ (may be identical or share weights) can be used to generate different representations $\mathbf{h}_{1},\mathbf{h}_{2},\cdots,\mathbf{h}_{K}$ for each view, given by

\mathbf{h}_{k}=f_{\theta_{k}}(\mathbf{A}_{i},\mathbf{X}_{i});k=1,2,\cdots,K

(10)

The contrastive learning aims to maximize the mutual information of two views from the same instance as

\max_{\theta_{1},\theta_{2},\cdots,\theta_{K}}\sum_{i}\sum_{j\neq i}\alpha_{i,j}\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j})

(11)

where $i,j\in\{1,2,\cdots,K\}$ , $\{\mathbf{h}_{i}\}_{i=1}^{K}$ are representations generated from $g=(\mathbf{A},\mathbf{X})$ , which are taken as positive samples. $\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j})$ are the mutual information between two representations $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ . Note that depending on different pretext tasks, $\{\mathbf{h}_{k}\}_{k=1}^{K}$ may not be at the same scale, either being a node-level, subgraph-level, or graph-level representation. The negative samples to contrast with $\{\mathbf{h}_{i}\}_{i=1}^{K}$ can be taken as representations $\{\widetilde{\mathbf{h}}_{i}\}_{i=1}^{K}$ that are generated from another graph $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})$ . Besides, we have $\alpha_{i,j}\in\{0,1\}$ , and their concrete values vary in different schemes.

The design of the contrastive learning for graph data can be summarized as three main modules: (1) data augmentation strategy, (2) pretext task, and (3) contrastive objective. The design of graph encoder is not the focus of graph self-supervised learning and beyond the scope of this survey; for more details, please refer to the related survey [24].

3.2 Data Augmentation

The recent works in the CV domain show that the success of contrastive learning relies heavily on well-designed data augmentation strategies, and in particular, certain kinds of augmentations play a very important role in improving performance. However, due to the inherent non-Euclidean properties of graph data, it is difficult to directly apply data augmentations designed for images to graphs. Here, we divide the data augmentation strategy for graph data into four categories: feature-based, structure-based, sampling-based, and adaptive augmentation. An overview of four types of augmentations is presented in Fig. 3.

3.2.1 Feature-based Augmentation

Given an input graph $(\mathbf{A},\mathbf{X})$ , a feature-based augmentation only performs transformation on the node feature matrix $\mathbf{X}$ or edge feature matrix $\mathbf{X}^{e}$ . Without loss of generality, we take $\mathbf{X}$ as an example, give by

\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathbf{A},\mathcal{T}_{\mathbf{X}}(\mathbf{X})

(12)

Attribute Masking. The attribute masking [25, 10, 11, 26, 27] randomly masks a small portion of attributes. We specify $\mathcal{T}_{\mathbf{X}}(\mathbf{X})$ for the attribute masking as

\mathcal{T}_{\mathbf{X}}(\mathbf{X})=\mathbf{X}\odot(1-\mathbf{L})+\mathbf{M}\odot\mathbf{L}

(13)

where $\mathbf{L}$ is a masking location matrix where $\mathbf{L}_{i,j}=1$ if the $j$ -th element of node $v_{i}$ is masked, otherwise $\mathbf{L}_{i,j}=0$ . $\mathbf{M}$ denotes a masking value matrix. The matrix $\mathbf{L}$ is usually sampled by Bernoulli distribution or assigned manually. Besides, different schemes of $\mathbf{M}$ result in different augmentations. For example, $\mathbf{M}=\mathbf{0}$ denotes a constant masking, $\mathbf{M}\sim N(\mathbf{0},\mathbf{\Sigma})$ replaces the original values by Gaussian noise and $\mathbf{M}\sim N(\mathbf{X},\mathbf{\Sigma})$ adds Gaussian noise to the input.

Attribute Shuffling. The attribute shuffling [9, 28, 29, 30, 31] performs the row-wise shuffling on the attribute matrix $\mathbf{X}$ . That is, the augmented graph consists of the same nodes as the original graph, but they are located in different places in the graph, and receive different contextual information. We specify $\mathcal{T}_{\mathbf{X}}(\mathbf{X})$ for the attribute shuffling as

\mathcal{T}_{\mathbf{X}}(\mathbf{X})=\mathbf{X}[idx,:]

(14)

where $idx$ is a list containing numbers from 1 to $N(N=|\mathcal{V}|)$ , but with a random arrangement.

3.2.2 Structue-based Augmentation

Given a graph $(\mathbf{A},\mathbf{X})$ , a structue-based augmentation only performs transformation on adjacent matrix $\mathbf{A}$ , as follows

\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathcal{T}_{\mathbf{A}}(\mathbf{A}),\mathbf{X}

(15)

Edge Perturbation. The edge perturbation [25, 14, 32, 33, 34] perturbs structural connectivity through randomly adding or removing a certain ratio of edges. We specify $\mathcal{T}_{\mathbf{A}}(\mathbf{A})$ for the edge perturbation as

\mathcal{T}_{\mathbf{A}}(\mathbf{A})=\mathbf{A}\odot(1-\mathbf{L})+(1-\mathbf{A})\odot\mathbf{L}

(16)

where $\mathbf{L}$ is a perturbation location matrix where $\mathbf{L}_{i,j}=\mathbf{L}_{j,i}=1$ if the edge between node $v_{i}$ and $v_{j}$ will be perturbed, otherwise $\mathbf{L}_{i,j}=\mathbf{L}_{j,i}=0$ . Different values in $\mathbf{L}$ result in different perturbation strategies, and more values set to 1 in $\mathbf{L}$ , more server the perturbation is.

Node Insertion. The node insertion [34] adds $K$ nodes $\mathcal{V}_{a}=\{v_{N+k}\}_{k=1}^{K}$ to node set $\mathcal{V}$ and add some edges between $\mathcal{V}_{a}$ and $\mathcal{V}$ . For a structure transformation $\widetilde{\mathbf{A}}=\mathcal{T}_{\mathbf{A}}(\mathbf{A})$ , we have $\widetilde{\mathbf{A}}_{:N,:N}=\mathbf{A}$ . Given the connection ratio $r$ , we have

p(\widetilde{\mathbf{A}}_{i,j}=\widetilde{\mathbf{A}}_{j,i}=1)=r,p(\widetilde{\mathbf{A}}_{i,j}=\widetilde{\mathbf{A}}_{j,i}=0)=1-r

(17)

for $N+1\leq i,j\leq N+K$ .

Edge Diffusion. The edge diffusion [35, 18] generates a different topological view of the original graph structure, with the general edge diffusion process defined as

\mathcal{T}_{\mathbf{A}}(\mathbf{A})=\sum_{k=0}^{\infty}\Theta_{k}\mathbf{S}^{k}

(18)

where $\mathbf{S}\in\mathbb{R}^{N\times N}$ is the generalized transition matrix and $\Theta$ is the weighting coefficient which satisfies $\sum_{k=0}^{\infty}\Theta_{k}=1,\Theta_{k}\in[0,1]$ . Two instantiations of Equ. 18 are: (1) Personalized PageRank (PPR) with $\mathbf{S}=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$ and $\Theta_{k}=\alpha(1-\alpha)^{k}$ , and (2) Heat Kernel (HK) with $\mathbf{S}=\mathbf{A}\mathbf{D}^{-1}$ and $\Theta_{k}=e^{-t}t^{k}/k!$ , where $\alpha$ denotes teleport probability in a random walk and $t$ is the diffusion time. The closed-form solutions of PPR and HK diffusion are formulated as

	$\displaystyle\mathcal{T}_{\mathbf{A}}^{PPR}(\mathbf{A})=$	$\displaystyle\alpha\left(\mathbf{I}_{n}-(1-\alpha)\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}\right)^{-1}$		(19)
	$\displaystyle\mathcal{T}_{\mathbf{A}}^{HK}(\mathbf{A})=$	$\displaystyle\exp\left(t\mathbf{A}\mathbf{D}^{-1}-t\right)$		(19)

3.2.3 Sampling-based Augmentation

Given an input graph $(\mathbf{A},\mathbf{X})$ , a sampling-based augmentation performs transformation on both the adjacent matrix $\mathbf{A}$ and feature matrix $\mathbf{X}$ , as follows

\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathbf{A}[\mathcal{S},\mathcal{S}],\mathbf{X}[\mathcal{S},:]

(20)

where $\mathcal{S}\in\mathcal{V}$ and existing methods usually apply five sampling strategies to obtain the node subset $\mathcal{S}$ : uniform sampling, ego-nets sampling, random walk sampling, importance sampling, and knowledge-based sampling.

Uniform Sampling. The uniform sampling [34] (a.k.a Node Dropping) uniformly samples a given number of nodes $\mathcal{S}$ from $\mathcal{V}$ and remove the remaining nodes directly.

Ego-nets Sampling [11, 36, 37]. Given a typical graph encoder with $L$ layers, the computation of the node representation only depends on its $L$ -hop neighborhood. In particular, for each node $v_{i}$ , the transformation $\mathcal{T}(\cdot)$ samples the $L$ -ego net surrounding node $v_{i}$ , with $\mathcal{S}$ defined as

\mathcal{S}=\{v_{j}\mid d(v_{i},v_{j})\leq L\}

(21)

where $d(v_{i},v_{j})$ is the shortest path length between node $v_{i}$ and $v_{j}$ . The Ego-nets Sampling is essentially a special version of Breadth-First Search (BFS) sampling.

Random Walk Sampling [38, 25, 18]. It starts a random walk on graph $g$ from the ego node $v_{i}$ . The walk iteratively travels to its neighborhood with the probability proportional to the edge weight. In addition, at each step, the walk returns back to the starting node $v_{i}$ with a positive probability $\alpha$ . Finally, the visited nodes are collected into a node subset $\mathcal{S}$ .

Importance Sampling [19]. Given a node $v_{i}$ , we can sample a subgraph based on the importance of its neighboring nodes, with an importance score matrix $\mathbf{M}$ defined as

\mathbf{M}=\alpha\cdot(\mathbf{I}_{n}-(1-\alpha)\cdot\mathbf{A}\mathbf{D}^{-1})

(22)

where $\alpha\in[0,1]$ is a hyperparameter. For a given node $v_{i}$ , the subgraph sampler chooses top- $k$ important neighbors anchored by $v_{i}$ to constitute a subgraph with the index of chosen nodes denoted as $\mathcal{S}=\operatorname{top}_{-}\operatorname{rank}(\mathbf{M}(i,:),k)$ .

Knowledge Sampling [39]. The knowledge-based sampling incorporates domain knowledge into subgraph sampling. For example, the sampling process can be formalized as a library-based matching problem by counting the frequently occurring and bioinformatics substructures in the molecular graph and building libraries (or tables) for them.

3.2.4 Adaptive Augmentation

The adaptive augmentation usually employs attention scores or gradients to guide the selection of nodes or edges.

Attention-based. The attention-based methods typically define importance scores for nodes or edges and then augment data based on their importance. For example, GCA [40] proposes to keeps important structures and attributes unchanged, while perturbing possibly unimportant edges and features. Specifically, the probability of edge removal and feature masking should be closely related to their importance. Given a node centrality measure $\varphi_{c}(\cdot):\mathcal{V}\rightarrow\mathbb{R}^{+}$ , it defines edge centrality as the average of two adjacent nodes’ centrality scores, i.e., $s_{i,j}=\log\frac{\varphi_{c}(v_{i})+\varphi_{c}(v_{j})}{2}$ . Then, the importance of the edge $e_{i,j}$ is defined as

p_{i,j}=\min\left(\frac{s_{\max}-s_{i,j}}{s_{\max}-\mu_{s}}\cdot p_{e},p_{\tau}\right)

(23)

where $p_{e}$ is a hyperparameter that controls the overall probability of removing edges, $s_{\max}$ and $\mu_{s}$ is the maximum and average of $\{s_{i,j}\}_{j=1}^{N}$ and $p_{\tau}<1$ is a cut-off probability, used to truncate the probabilities since extremely high removal probabilities will overly corrupt graph structures. The node centrality can be defined as degree centrality, Eigenvector centrality [41], or PageRank centrality [42], which results in three variants. The attribute masking based on node importance is the same as above and will not be repeated.

Gradient-based. Unlike the simple uniform edge removal and insertion as in GRACE [26], GROC [43] adaptively performs gradient-based augmentation guided by edge gradient information. Specifically, it first applies two stochastic transformations $\mathcal{T}_{1}(\cdot)$ and $\mathcal{T}_{2}(\cdot)$ to graph $g=(\mathbf{A},\mathbf{X})$ to obtain two views, masking node attributes independently with probability $r_{1}$ and $r_{2}$ and then computing the contrastive loss $\mathcal{L}_{ssl}$ between these two views. For a given node $v_{i}$ , an edge removal candidate set is defined as

\displaystyle\mathcal{S}^{-}=\left\{(v_{i},v_{k})\Big{|}v_{k}\in\mathcal{N}_{i}^{(l)}\right\}

(24)

, and an edge insertation candidate set is defined as

\displaystyle\mathcal{S}^{+}=\left\{(v_{i},v_{k})\Big{|}v_{k}\in\left(\cup_{v_{m}\in\mathcal{B}}\mathcal{N}_{m}^{(l)}\backslash\mathcal{N}_{i}^{(l)}\right)\right\}

(25)

where $\mathcal{B}\subset\mathcal{V}$ is a node batch. $\mathcal{S}^{+}$ is restricted to the set of edges $(v_{i},v_{k})$ where $v_{i}$ is an anchor node, and $v_{k}$ is within the $l$ -hop neighborhood of some other anchors $v_{m}\neq v_{i}$ but not within the $l$ -hop neighborhood of node $v_{i}$ . Finally, we backpropagate the loss $\mathcal{L}_{ssl}$ to obtain gradient intensity values for each edge in $\mathcal{S}^{-}$ and $\mathcal{S}^{+}$ . A further gradient-based adaptive augmentations are applied on the views by removing a subset of edges with minimal edge gradient magnitude values in $\mathcal{S}^{-}$ and inserting a subset of edges with the maximal edge gradient magnitude values in $\mathcal{S}^{+}$ .

3.3 Pretext Task

The contrastive learning aims to maximize the agreement of two jointly sampled positive pairs. Depending on the definition of a graph view, the scale of the view may be local, contextual, or global, corresponding to the node-level, subgraph-level, or graph-level information in the graph. Therefore, contrastive learning may contrast two graph views at the same or different scales, which leads to two categories: (1) Contrasting with the same-scale and (2) Contrasting with the cross-scale. The two views in the same-scale contrasting, either positive or negative pairs, are at the same scale, such as node-node and graph-graph pairs, while the two views in the cross-scale contrasting have different scales, such as node-subgraph or node-graph contrasting. We categorize existing methods from these two perspectives and present them in a unified framework as shown Fig. 4. In this section, due to space limitations, we present only some representative contrastive methods and place those relatively less important works in Appendix A.

3.3.1 Contrasting with the same-scale

The same-scale contrastive learning is further refined into three categories based on the different scales of the views: local-local, context-context, and global-global contrasting.

3.3.1.1 Global-Global Contrasting

GraphCL [25]. Four types of graph augmentations $\{\mathcal{T}_{k}\}_{k=1}^{4}$ are applied to incorporate various priors: (1) Node Dropping $\mathcal{T}_{1}(\cdot)$ ; (2) Edge Perturbation $\mathcal{T}_{2}(\cdot)$ ; (3) Attribute Masking $\mathcal{T}_{3}(\cdot)$ ; (4) Subgraph Sampling $\mathcal{T}_{4}(\cdot)$ . Given a graph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}$ , it first applies a series of graph augmentations $\mathcal{T}(\cdot)$ randomly selected from $\{\mathcal{T}_{k}\}_{k=1}^{4}$ to generate an augmented graph $\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i})$ , and then learns to predict whether two graphs originate from the same graph or not. Specifically, a shared graph-level encoder $f_{\gamma}(\cdot)$ is applied to obtain graph-level representations $\mathbf{h}_{g_{i}}=f_{\gamma}(\mathbf{A}_{i},\mathbf{X}_{i})$ and $\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}=f_{\gamma}(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})$ , respectively. Finally, the learning objective is defined as follows

\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right)

(26)

Contrastive Self-supervised Learning (CSSL) [34] follows a very similar framework to GraphGL, differing only in the way the data is augmented. Along with node dropping, it also considers node insertion as an important augmentation strategy. Specifically, it randomly selects a strongly-connected subgraph $S$ , removes all edges in $S$ , adds a new node $v_{i}$ , and adds an edge between $v_{i}$ and each node in $S$ .

Label Contrastive Coding (LCC) [45] is proposed to encourage intra-class compactness and inter-class separability. To power contrastive learning, LLC introduces a dynamic label memory bank and a momentum updated encoder. Specifically, the query graph $(g_{q},y_{q})$ and key graph $(g_{k},y_{k})$ are encoded by two graph-level encoder $f_{\gamma_{q}}(\cdot)$ and $f_{\gamma_{k}}(\cdot)$ to obtain graph-level representations $\mathbf{h}_{g_{q}}$ and $\mathbf{h}_{g_{k}}$ respectively. If $\mathbf{h}_{g_{q}}$ and $\mathbf{h}_{g_{k}}$ have the same label, they are considered as the positive pair, otherwise, they are the negative pair. The label contrastive loss encourages the model to distinguish the positive pair from the negative pair. For the encoded query $(g_{q},y_{q})$ , its label contrastive loss is calculated by

\max_{\gamma_{q}}\log\frac{\sum_{i=1}^{m}\mathbb{I}_{y_{i}=y_{q}}\cdot\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}{\sum_{i=1}^{m}\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}

(27)

where $m$ is the size of memory bank, $\tau$ is the temperature hyperparameter, and $\mathbb{I}_{y_{i}=y_{q}}$ is an indicator function to determine whether the label of $i$ -th key graph $g_{k}^{(i)}$ in the memory bank is the same as $y_{q}$ . The parameter $\gamma_{k}$ of $f_{\gamma_{k}}(\cdot)$ follows a momentum-based update mechanism as Moco [2], given by

\gamma_{k}\longleftarrow\alpha\gamma_{k}+(1-\alpha)\gamma_{q}

(28)

where $\alpha\in[0,1)$ is the momentum weight to control the speed of $\gamma_{k}$ evolving.

3.3.1.2 Context-Context Contrasting

Graph Contrastive Coding (GCC) [38] is a graph self-supervised pre-training framework, that captures the universal graph topological properties across multiple graphs. Specifically, it first samples multiple subgraphs for each graph $g\in\mathcal{G}$ by random walk and collect them in to a memory bank $\mathcal{S}$ . Then the query subgraph $g_{q}\in\mathcal{S}$ and key subgraph $g_{k}\in\mathcal{S}$ are encoded by two graph-level encoders $f_{\gamma_{q}(\cdot)}$ and $f_{\gamma_{k}}(\cdot)$ to obtain graph-level representations $\mathbf{h}_{g_{q}}$ and $\mathbf{h}_{g_{k}}$ , respectively. If $g_{q}$ and $g_{k}$ are sampled from the same graph, they are considerd as the positive pair, otherwise they are the negative pair. For the encoded query $(g_{q},y_{q})$ where $y_{q}$ is the index of graph it sampled from, its graph contrastive loss is calculated by

\max_{\gamma_{q}}\log\frac{\sum_{i=1}^{|\mathcal{S}|}\mathbb{I}_{y_{i}=y_{q}}\cdot\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}{\sum_{i=1}^{|\mathcal{S}|}\exp\left(\mathbf{h}_{g_{q}}\cdot\mathbf{h}_{g_{k}}^{(i)}/\tau\right)}

(29)

where $\mathbb{I}_{y_{i}=y_{q}}$ is an indicator function to determine whether the $i$ -th key graph $g_{k}^{(i)}$ in the memory bank and query graph $g_{q}$ are sampled from the same graph. The parameter $\gamma_{k}$ of $f_{\gamma_{k}}(\cdot)$ follows a momentum-based updating as in Equ. 28.

3.3.1.3 Local-Local Contrasting

GRACE [26]. Rather than contrasting global-global views as GraphCL [25] and CSSL [34], GRACE focuses on contrasting views at the node level. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first generates two augmentatd graphs $g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X})$ and $g^{(2)}=(\mathbf{A}^{(1)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X})$ . Then it applies a shared encoder $f_{\theta}(\cdot)$ to generate their node embedding matrices $\mathbf{H}^{(1)}=f_{\theta}(\mathbf{A}^{(1)},\mathbf{X}^{(1)})$ and $\mathbf{H}^{(2)}=f_{\theta}(\mathbf{A}^{(2)},\mathbf{X}^{(2)})$ . Finally, the pairwise objective for each positive pair $(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})$ is defined as follows

\mathcal{L}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})=\log\frac{e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})/\tau}}{e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})/\tau}+Neg}

(30)

where $Neg$ is defined as

Neg=\sum_{k=1}^{N}\mathbf{1}_{k\neq i}\left[e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(1)}_{k})/\tau}+e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{k})/\tau}\right]

(31)

where $e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(1)}_{k})/\tau}$ is the intra-view negative pair and $e^{\mathcal{D}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{k})/\tau}$ is the inter-view negative pair. The overall objective to be maximized is then defined as,

\max_{\theta}\frac{1}{2N}\sum_{i=1}^{N}\left[\mathcal{L}(\mathbf{h}^{(1)}_{i},\mathbf{h}^{(2)}_{i})+\mathcal{L}(\mathbf{h}^{(2)}_{i},\mathbf{h}^{(1)}_{i})\right]

(32)

GCA [40] and GROC [43] adopt the same framework and objective as GRACE but with more flexible and adaptive data augmentation strategies. The framework proposed by SEPT [46] is similar to GRACE, but it is specifically designed for the specific downstream task (recommendation) by combining cross-view contrastive learning with semi-supervised tri-training. Technically, SEPT first augments the user data with the user social information, and then it builds three graph encoders upon the augmented views, with one for recommendation and the other two used to predict unlabeled users. Given a certain user, SEPT takes those nodes whose predicted labels are highly consistent with the target user as positive samples and then encourages the consistency between the target user and positive samples.

Cross-layer Contrasting (GMI) [16]. Given a graph $g=(\mathbf{A},\mathbf{X})$ , a graph encoder $f_{\theta}(\cdot)$ is applied to obtain the node embedding matrix $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ . Then the Cross-layer Node Contrasting can be defined as

\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{x}_{i}\right)

(33)

where the negative samples to contrast with $\mathbf{h}_{i}$ is $Neg(\mathbf{h}_{i})=\{\mathbf{x}_{j}\mid v_{j}\in\mathcal{N}_{i}\}$ . Similarly, the Cross-layer Edge Contrasting can be defined as

\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\sum_{v_{j}\in\mathcal{N}_{i}}\mathcal{MI}\left(\mathbf{w}_{i,j},\mathbf{A}_{i,j}\right)

(34)

where $\mathbf{w}_{i,j}=\sigma(\mathbf{h}_{i}\mathbf{h}_{j}^{T})$ , and the negative samples to contrast with $\mathbf{w}_{i,j}$ are $Neg(\mathbf{w}_{i,j})=\{\mathbf{A}_{i,k}\mid v_{k}\in\mathcal{N}_{i}\quad\text{and}\quad k\neq j\}$ .

STDGI [28] extents the idea of mutual information maximization to spatial-temporal graphs. Specifically, given two graphs $g_{t}=(\mathbf{A},\mathbf{X}^{(t)})$ and $g_{t+k}=(\mathbf{A},\mathbf{X}^{(t+k)})$ at the time $t$ and $t+k$ , a shared graph encoder $f_{\theta}(\cdot)$ is applied to obtain the node embedding matrix $\mathbf{H}^{(t)}=f_{\theta}(\mathbf{A},\mathbf{X}^{(t)})$ . Besides, it generates an augmentatd graph $\widetilde{g}_{t+k}=(\mathbf{A},\widetilde{\mathbf{X}}^{(t+k)})=\mathcal{T}(\mathbf{A},\mathbf{X}^{(t+k)})$ by randomly permuting the node features. Finally, the learning objective is defined as follows

\max_{\theta}\frac{1}{N}\sum_{i=1}^{N}\mathcal{MI}\left(\mathbf{h}^{(t)}_{i},\mathbf{x}^{(t+k)}_{i}\right)

(35)

where the negative samples to contrast with $\mathbf{h}^{(t)}_{i}$ is $Neg(\mathbf{h}^{(t)}_{i})=\widetilde{\mathbf{x}}^{(t+k)}_{i}$ .

BGRL [27]. Inspired by BYOL, BGRL proposes to perform the self-supervised learning that does not require negative samples, thus getting rid of the potentially quadratic bottleneck. Specifically, given a graph $g=(\mathbf{A},\mathbf{X})$ , it first generates two augmentatd graph views $g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X})$ and $g^{(2)}=(\mathbf{A}^{(1)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X})$ . Then it applies two graph encoders $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ to generate their node embedding matrices $\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A}^{(1)},\mathbf{X}^{(1)})$ and $\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A}^{(2)},\mathbf{X}^{(2)})$ . Moreover, a node-level prediction head $g_{\omega}(\cdot)$ is used to output $\mathbf{Z}^{(1)}=g_{\omega}(\mathbf{H}^{(1)})$ . Finally, the learning objective is defined as follows

\max_{\theta_{1},\omega}\frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{z}^{(1)}_{i}(\mathbf{h}^{(2)}_{i})^{T}}{\|\mathbf{z}^{(1)}_{i}\|\|\mathbf{h}^{(2)}_{i}\|}

(36)

where the parameter $\theta_{2}$ are updated as an exponential moving average (EMA) of parameters $\theta_{1}$ , as done in Equ. 28.

SelfGNN [35] differs from BGRL only in the definition of the objective function. Unlike Equ. 36, SelfGNN defines the implicit contrastive term directly in the form of MSE,

\min_{\theta_{1},\omega}\frac{1}{N}\sum_{i=1}^{N}\|\mathbf{z}^{(1)}_{i}-\mathbf{h}^{(2)}_{i}\|^{2}

(37)

HeCo [47]. Consider a meta-path $\Phi_{k}$ form the meta-path set $\{\Phi_{k}\}_{k=1}^{K}$ , if there exist a meta-path $\Phi_{k}$ between node $v_{i}$ and node $v_{j}$ , then $v_{j}$ can be considered as in the meta-path neighborhood $\mathcal{N}_{i}^{\Phi_{k}}$ of node $v_{i}$ , which yields a meta-path based adjacent matrix $\mathbf{A}^{\Phi_{k}}$ . The HeCo first applies two graph encoder $f_{\theta_{1}}^{sc}(\cdot)$ and $f_{\theta_{2}}^{mp}(\cdot)$ to obtain node embedding matrices $\mathbf{H}^{sc}=f_{\theta_{1}}^{sc}(\mathbf{A},\mathbf{X})$ and $\mathbf{H}^{mp}=f_{\theta_{2}}^{ml}(\{\mathbf{A}^{\Phi_{k}}\}_{k=1}^{K},\mathbf{X})$ . To define positive and negative samples, HeCo first defines a function $\mathbb{C}_{i}(j)=\sum_{k=1}^{K}\mathbb{I}\left(j\in\mathcal{N}_{i}^{\Phi_{k}}\right)$ to count the number of meta-paths connecting nodes $v_{i}$ and $v_{j}$ . Then it constructs a set $\mathcal{S}_{i}=\left\{j\mid j\in\mathcal{V}\text{ and }\mathbb{C}_{i}(j)\neq 0\right\}$ and sort it in the descending order based on the value of $\mathbb{C}_{i}(j)$ . Next it selects the top $T_{pos}$ nodes from $\mathcal{S}_{i}$ as positive samples $\mathbb{P}_{i}$ and treat the rest as negative samples $\mathbb{N}_{i}$ directly. Finally, the learning objective can be defined as follows

\max_{\theta_{1},\theta_{2}}\frac{1}{N}\sum_{i=1}^{N}\log\frac{\sum_{v_{j}\in\mathbb{P}_{i}}e^{\mathcal{D}(\mathbf{h}^{sc}_{i},\mathbf{h}^{mp}_{j})/\tau}}{\sum_{v_{k}\in\{\mathbb{P}_{i}\cup\mathbb{N}_{i}\}}e^{\mathcal{D}(\mathbf{h}^{sc}_{i},\mathbf{h}^{mp}_{k})/\tau}}

(38)

3.3.2 Contrasting with the cross-scale

Based on different scales of two views, we further refined the scope of cross-scale contrastive into three categories: local-global, local-context, and context-global contrasting.

3.3.2.1 Local-Global Contrasting

Deep Graph Infomax (DGI) [9] is proposed to contrast the patch representations and corresponding high-level summary of graphs. First, it applies an augmentation transformation $\mathcal{T}(\cdot)$ to obtain an augmented graph $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then it passes these two graphs through two graph encoder $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ to obtain node embedding matrices $\widetilde{\mathbf{H}}=f_{\theta_{1}}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})$ and $\mathbf{H}=f_{\theta_{2}}(\mathbf{A},\mathbf{X})$ , respectively. Beside, a $\operatorname{READOUT}$ function is applied to obtain the graph-level representaion $\widetilde{\mathbf{h}}_{\widetilde{g}}=\operatorname{READOUT}(\widetilde{\mathbf{H}})$ . Finally, the learning objective is defined as follows

\max_{\theta_{1},\theta_{2}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\widetilde{\mathbf{h}}_{\widetilde{g}},\mathbf{h}_{i}\right)

(39)

where $\mathbf{h}_{i}$ is the node embedding of node $v_{i}$ , and the negative samples to contrast with $\widetilde{\mathbf{h}}_{\widetilde{g}}$ is $Neg(\widetilde{\mathbf{h}}_{\widetilde{g}})=\{\mathbf{h}_{j}\}_{v_{j}\in\mathcal{V},j\neq i}$ .

MVGRL [18] maximize the the mutual information between the cross-view representations of nodes and graphs. Given a $g=(\mathbf{A},\mathbf{X})\in\mathcal{G}$ , it first applies an augmentation to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ and then samples two subgraph $g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)})=\mathcal{T}_{1}(\mathbf{A},\mathbf{X})$ and $g^{(2)}=(\mathbf{A}^{(2)},\mathbf{X}^{(2)})=\mathcal{T}_{2}(\mathbf{A},\mathbf{X})$ from it. Then two graph encoder $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ and a prejection head $g_{\omega_{1}}(\cdot)$ are applied to obtain node embedding matrices $\mathbf{H}^{(1)}=g_{\omega_{1}}(f_{\theta_{1}}(\mathbf{A}^{(1)},\mathbf{X}^{(1)}))$ and $\mathbf{H}^{(2)}=g_{\omega_{1}}(f_{\theta_{2}}(\mathbf{A}^{(2)},\mathbf{X}^{(2)}))$ . In addition, a $\operatorname{READOUT}$ function and another prejection head $g_{\omega_{2}}(\cdot)$ are use to obtain graph-level representations $\mathbf{h}_{g}^{(1)}=f_{\omega_{2}}(\operatorname{READOUT}(\mathbf{H}^{(1)}))$ and $\mathbf{h}_{g}^{(2)}=f_{\omega_{2}}(\operatorname{READOUT}(\mathbf{H}^{(2)}))$ . The learning objective is defined as follows:

\displaystyle\max_{\theta_{1},\theta_{2},\omega_{1},\omega_{2}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\left[\mathcal{MI}(\mathbf{h}_{g}^{(1)},\mathbf{h}_{i}^{(2)})+\mathcal{MI}(\mathbf{h}_{g}^{(2)},\mathbf{h}_{i}^{(1)})\right]

(40)

where the negative samples to contrast with $\mathbf{h}_{g}^{(1)}$ is $Neg(\mathbf{h}_{g}^{(1)})=\{\mathbf{h}_{j}^{(2)}\}_{v_{j}\in\mathcal{V},j\neq i}$ and the negative samples to contrast with $\mathbf{h}_{g}^{(2)}$ is $Neg(\mathbf{h}_{g}^{(2)})=\{\mathbf{h}_{j}^{(1)}\}_{v_{j}\in\mathcal{V},j\neq i}$ .

3.3.2.2 Local-Context Contrasting

SUBG-CON [19] is proposed by utilizing the strong correlation between central (anchor) nodes and their surrounding subgraphs to capture contextual structure information. Given a graph $g=(\mathbf{A},\mathbf{X})$ , SUBG-CON first picks up an anchor node set $\mathcal{S}$ from $\mathcal{V}$ and then samples their context subgraph $\{g_{i}=(\mathbf{A}^{(i)},\mathbf{X}^{(i)})\}_{i=1}^{|\mathcal{S}|}$ by the importance sampling strategy. Then a shared graph encoder $f_{\theta}(\cdot)$ and a $\operatorname{READOUT}$ function are applied to obtain node embedding matrices $\{\mathbf{H}^{(1)},\mathbf{H}^{(2)},\cdots,\mathbf{H}^{(|\mathcal{V}|)}\}$ where $\mathbf{H}^{(i)}=f_{\theta}(\mathbf{A}^{(i)},\mathbf{X}^{(i)})$ and graph-level representations $\{\mathbf{h}_{g_{1}},\mathbf{h}_{g_{2}},\cdots,\mathbf{h}_{g_{|\mathcal{V}|}}\}$ where $\mathbf{h}_{g_{i}}=\operatorname{READOUT}(\mathbf{H}^{(i)})$ . Finally, the learning objective is defined as follows

\max_{\theta}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\mathcal{MI}\left(\mathbf{h}^{(i)}_{i},\mathbf{h}_{g_{i}}\right)

(41)

where $\mathbf{h}^{(i)}_{i}$ is the node representation of anchor node $v_{i}$ in the node embedding matrix $\mathbf{H}^{(i)}$ . The negative samples to contrast with $\mathbf{h}^{(i)}_{i}$ is $Neg(\mathbf{h}^{(i)}_{i})=\{\mathbf{h}_{g_{j}}\}_{v_{j}\in\mathcal{S},j\neq i}$ .

Graph InfoClust (GIC) [48] relies on a framework similar to DGI [9]. However, in addition to contrast local-global views, GIC also maximize the MI between node representations and their corresponding cluster embeddings. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first applies an augmentation to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then a shared graph encoder $f_{\theta}(\cdot)$ is applied to obtain node embedding matrices $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ and $\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})$ . Furthermore, an unsupervised clustering algorithm is used to group nodes into $K$ clusters $\mathcal{C}=\{C_{1},C_{2},\cdots,C_{K}\}$ , and it obtains the cluster centers by $\bm{\mu}_{k}=\frac{1}{|C_{k}|}\sum_{v_{i}\in C_{k}}\mathbf{h}_{i}$ where $1\leq k\leq K$ . To compute the cluster embedding $\mathbf{z}_{i}$ for each node $v_{i}$ , it applies a weighted average of the summaries of the cluster centers to which node $v_{i}$ belongs $\mathbf{z}_{i}=\sigma\left(\sum_{k=1}^{K}r_{ik}\bm{\mu}_{k}\right)$ , where $r_{ik}$ is the probability that node $v_{i}$ is assigned to cluster $k$ , and is a soft-assignment value with $\sum_{k}r_{ik}=1,\forall i$ . For example, $r_{i,k}$ can be defined as $r_{i,k}=\frac{\exp(\mathbf{h}_{i}\bm{\mu}_{k}^{T})}{\sum_{j=1}^{K}\exp(\mathbf{h}_{i}\bm{\mu}_{j}^{T})}$ . Finally, the learning objective is defined as follows

\max_{\theta}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{z}_{i}\right)

(42)

where the negative samples to contrast with $\mathbf{h}_{i}$ is $Neg(\mathbf{h}_{i})=\{\mathbf{z}_{j}\}_{v_{j}\in\mathcal{V},j\neq i}$ .

3.3.2.3 Context-Global Contrasting

MICRO-Graph [39]. The key challenge to conducting subgraph-level contrastive is to sample semantically informative subgraphs. For molecular graphs, the graph motifs, which are frequently-occurring subgraph patterns (e.g., functional groups) can be exploited for better subgraph sampling. Specifically, the motif learning is formulated as a differentiable clustering problem, and EM-clustering is adopted to group significant subgraphs into several motifs, thus obtaining a motifs table. Given two graph $g^{(1)}=(\mathbf{A}^{(1)},\mathbf{X}^{(1)}),g^{(2)}=(\mathbf{A}^{(2)},\mathbf{X}^{(2)})\in\mathcal{G}$ , it first applies a shared graph encoder $f_{\theta}(\cdot)$ to learn their node embedding matrices $\mathbf{H}^{(1)}=f_{\theta}(\mathbf{A}^{(1)},\mathbf{X}^{(1)})$ and $\mathbf{H}^{(2)}=f_{\theta}(\mathbf{A}^{(2)},\mathbf{X}^{(2)})$ . Then it leverages learned motifs table to sample $K$ motif-like subgraphs from $g^{(1)}$ and $g^{(2)}$ and obtain their correspongding embedding matrices $\{\mathbf{H}^{(1)}_{1},\mathbf{H}^{(1)}_{2},\cdots,\mathbf{H}^{(1)}_{K}\}$ and $\{\mathbf{H}^{(2)}_{1},\mathbf{H}^{(2)}_{2},\cdots,\mathbf{H}^{(2)}_{K}\}$ . Then a $\operatorname{READOUT}$ function is applied to obtain graph-level and subgraph-level representations, denoted as $\mathbf{h}^{(1)}_{g}$ , $\{\mathbf{h}^{(1)}_{1},\mathbf{h}^{(1)}_{2},\cdots,\mathbf{h}^{(1)}_{K}\}$ and $\mathbf{h}^{(2)}_{g}$ , $\{\mathbf{h}^{(2)}_{1},\mathbf{h}^{(2)}_{2},\cdots,\mathbf{h}^{(2)}_{K}\}$ . Finally, the objective is defined as

\displaystyle\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g\in\mathcal{G}}\sum_{k=1}^{K}\left[\mathcal{MI}(\mathbf{h}_{g}^{(1)},\mathbf{h}_{k}^{(1)})+\mathcal{MI}(\mathbf{h}_{g}^{(2)},\mathbf{h}_{k}^{(2)})\right]

(43)

where the negative samples to contrast with $\mathbf{h}_{g}^{(1)}$ is $Neg(\mathbf{h}_{g}^{(1)})=\{\mathbf{h}_{j}^{(2)}\}_{j=1}^{K}$ and the negative samples to contrast with $\mathbf{h}_{g}^{(2)}$ is $Neg(\mathbf{h}_{g}^{(2)})=\{\mathbf{h}_{j}^{(1)}\}_{j=1}^{K}$ .

InfoGraph [49] aims to obtain embeddings at the whole graph level for self-supervised learning. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first applies an augmentation to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then a shared $L$ -layer graph encoder $f_{\theta}(\cdot)$ is applied to learn node embedding matrix sequences $\{\mathbf{H}^{(l)}\}_{l=1}^{L}$ and $\{\widetilde{\mathbf{H}}^{(l)}\}_{l=1}^{L}$ obtain from each layer. Then it concats the representations learned from each layer, $\mathbf{h}_{i}=\text{CONCAT}(\{\mathbf{h}^{(l)}_{i}\}_{l=1}^{L})$ and $\widetilde{\mathbf{h}}_{i}=\text{CONCAT}(\{\widetilde{\mathbf{h}}^{(l)}_{i}\}_{l=1}^{L})$ , where $\mathbf{h}^{(l)}_{i}$ is the embedding of node $v_{i}$ in the node embedding matrix $\mathbf{H}^{(l)}$ obtained from the $l$ -th layer of the graph encoder. In addition, a $\operatorname{READOUT}$ function is used to obatain the graph-level representation $\mathbf{h}_{g}=\operatorname{READOUT}(\{\mathbf{h}_{i}\}_{i=1}^{N})$ . Finally, the learning objective is defined as follows

\max_{\theta}\sum_{g\in\mathcal{G}}\frac{1}{|g|}\sum_{v_{i}\in g}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i}\right)

(44)

where the negative samples to contrast with $\mathbf{h}_{g}$ is node representations on the augmented graph $Neg(\mathbf{h}_{g})=\{\widetilde{\mathbf{h}}_{i}\}_{v_{i}\in\mathcal{V}}$ .

BiGi [37] is specifically designed for bipartite graph, where the class label $y_{i}\in\{0,1\}$ of each node $v_{i}$ is already known. For a given $g=(\mathbf{A},\mathbf{X})$ , it first applies a structure-based augmentation to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\mathbf{X})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then a shared graph encoder $f_{\theta}(\cdot)$ is applied to obtain $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ and $\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\mathbf{X})$ . Beisde, it can obtain the graph-level representation from $\mathbf{H}$ directly as follows

\displaystyle\mathbf{h}_{g}=\Big{[}\sigma\Big{(}\frac{1}{|\mathcal{V}^{(1)}|}\sum_{v_{i}\in\mathcal{V}^{(1)}}\mathbf{h}\Big{)}\Big{\|}\sigma\Big{(}\frac{1}{|\mathcal{V}^{(2)}|}\sum_{v_{i}\in\mathcal{V}^{(2)}}\mathbf{h}_{i}\Big{)}\Big{]}

(45)

where $\mathcal{V}^{(1)}=\{v_{i}|v_{i}\in\mathcal{V},y_{i}=0\}$ and $\mathcal{V}^{(2)}=\{v_{i}|v_{i}\in\mathcal{V},y_{i}=1\}$ . For a given edge $(v_{i},v_{j})\in\mathcal{E}$ , it first performs the ege-nets sampling to obtain two subgraph (centered at node $v_{i}$ and $v_{j}$ ), and then gets their node feature matrix $\mathbf{H}^{(i)}$ and $\mathbf{H}^{(j)}$ from $\mathbf{H}$ directly. Then a subgraph-level attention module (similar to GAT) is applied to obatain two subgraph-level representation $\mathbf{h}_{i}=Att_{\gamma}(\mathbf{H}^{(i)})$ and $\mathbf{h}_{j}=Att_{\gamma}(\mathbf{H}^{(j)})$ . Finally, $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ are fused to obtain $\mathbf{h}_{i,j}$ = $[\mathbf{h}_{i}\|\mathbf{h}_{j}]$ . Similarity, it can obtain the fused representation $\widetilde{\mathbf{h}}_{i,j}$ from $\widetilde{\mathbf{H}}$ . Finally, the learning objective is defined as follows:

\max_{\theta,\gamma}\frac{1}{|\mathcal{E}|}\sum_{(v_{i},v_{j})\in\mathcal{E}}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i,j}\right)

(46)

where the negative samples is defined as $Neg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}_{i,j}$ .

3.4 Contrasive Objectives

The main way to optimize the contrastive learning is to treat two representations (views) $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ as random variables and maximize their mutual information, given by

\mathcal{MI}(\mathbf{h}_{i},\mathbf{h}_{j})=\mathbb{E}_{p(\mathbf{h}_{i},\mathbf{h}_{j})}\left[\log\frac{p(\mathbf{h}_{i},\mathbf{h}_{j})}{p(\mathbf{h}_{i})p(\mathbf{h}_{j})}\right]

(47)

To computationally estimate the mutual information in contrastive learning, three lower-bound forms of the mutual information are derived, and then the mutual information is maximized indirectly by maximizing their lower-bounds.

Donsker-Varadhan Estimator [50] is one of the lower-bound to the mutual information, defined as

	$\displaystyle\mathcal{MI}_{DV}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=$	$\displaystyle\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\left[\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right]$		(48)
		$\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\left[e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\right]$		(48)

where $p(\mathbf{h}_{i},\mathbf{h}_{j})$ denotes the joint distribution of two representations $\mathbf{h}_{i}$ , $\mathbf{h}_{j}$ , and $p(\mathbf{h}_{i})p(\mathbf{h}_{j})$ denotes the product of marginals. $\mathcal{D}:\mathbb{R}^{q}\times\mathbb{R}^{q}\rightarrow\mathbb{R}$ is a discriminator that maps two views $\mathbf{h}_{i}$ , $\mathbf{h}_{j}$ to an agreement score. Generally, the discriminator $\mathcal{D}$ can optionally apply an additional prediction head $g_{\omega}(\cdot)$ to map $\mathbf{h}_{i}$ to $\mathbf{z}_{i}=g_{\omega}(\mathbf{h}_{i})$ before computing agreement scores, where $g_{\omega}(\cdot)$ can be a linear mapping, a nonlinear mapping (e.g., MLP), or even a non-parametric identical mapping ( $\mathbf{z}_{i}$ = $\mathbf{h}_{i}$ ). The discriminator $\mathcal{D}$ can be taken in various forms, i.e., the standard inner product $\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\mathbf{z}_{i}^{T}\mathbf{z}_{j}$ , the inner product $\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\mathbf{z}_{i}^{T}\mathbf{z}_{j}/\tau$ with temperature parameter $\tau$ , the cosine similarity $\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\frac{\mathbf{z}_{i}^{T}\mathbf{z}_{j}}{||\mathbf{z}_{i}||||\mathbf{z}_{j}||}$ , or the gaussian similarity $\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})=\exp\left(-\frac{||\mathbf{z}_{i}-\mathbf{z}_{j}||_{2}^{2}}{2\sigma^{2}}\right)$ .

Jensen-Shannon Estimator. Replacing the KL-divergence in Equ. 47 with the JS-divergence, it derives another Jensen-Shannon (JS) estimator [51] which can estimate and optimize the mutual information more efficiently. The Jensen-Shannon (JS) estimator is defined as

	$\displaystyle\mathcal{MI}_{JS}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\bigg{[}\log\left(\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j}\right))\bigg{]}$		(49)
		$\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\bigg{[}log\Big{(}1-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]}$		(49)

Let $\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\text{sigmod}(\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j})\right)$ , the Equ.49 can be re-writed as a softplus (SP) version [18, 49], as follows

	$\displaystyle\mathcal{MI}_{SP}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\bigg{[}-sp\Big{(}-\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]}$		(50)
		$\displaystyle-\log\mathbb{E}_{p\left(\mathbf{h}_{i}\right)p\left(\mathbf{h}_{j}\right)}\bigg{[}sp\Big{(}\mathcal{D}^{\prime}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\Big{)}\bigg{]}$		(50)

where $sp(x)=\log\left(1+e^{x}\right)$ .

InfoNCE Estimator. InfoNCE [52] is one of the most popular lower-bound to the mutual information, defined as

	$\displaystyle\mathcal{MI}_{NCE}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{p\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}\Bigg{[}\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)$		(51)
		$\displaystyle-\mathbb{E}_{K\sim\mathcal{P}^{N}}\Big{[}\log\frac{1}{N}\sum_{\mathbf{h}_{j}^{\prime}\in K}e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)}\Big{]}\Bigg{]}$		(51)

where $K$ consists of $N$ random variables sampled from a n identical and independent distribution. NT-Xent loss [53] is a special version of the InfoNCE loss, which defines the discriminator $\mathcal{D}$ as $\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})=\mathbf{h}_{i}^{T}\mathbf{h}_{j}/\tau$ with temperature parameter $\tau$ . Taking graph classification as an example, $f_{\gamma}(\cdot)$ is a graph encoder that maps a graph $g=(\mathbf{A},\mathbf{X})\in\mathcal{G}$ to a graph-level representation $\mathbf{h}_{g}=f_{\gamma}(\mathbf{A},\mathbf{X})$ . The InfoNCE is in practice computed on a mini-batch $\boldsymbol{B}$ of size $N+1$ , then the Equ. 51 can be re-writed (with $\log N$ discarded) as

\displaystyle\mathcal{MI}_{NCE}=-\frac{1}{N+1}\sum_{(\mathbf{A},\mathbf{X})\in\boldsymbol{B}}\log\frac{e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}}{\sum_{(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\in\boldsymbol{B}}e^{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)}}

(52)

where $\mathbf{h}_{i}$ , $\mathbf{h}_{j}$ are the positive pair of views that comes from the same graph $g=(\mathbf{A},\mathbf{X})$ , and $\mathbf{h}_{i}$ and $\mathbf{h}_{j}^{\prime}$ are the negative pair of views that are computed from $g=(\mathbf{A},\mathbf{X})$ and $g^{\prime}=(\mathbf{A}^{\prime},\mathbf{X}^{\prime})$ identically and independently.

Triplet Margin Loss. While all three MI estimators above estimate the lower bound on mutual information, mutual information maximization has been shown not to be a must for contrastive learning [54]. For example, the triplet margin loss [55] can also be used to optimize the contrastive learning, but it is not an MI-based contrastive objective, and optimizing it does not guarantee the maximization of mutual information. The triplet margin loss is defined as

	$\displaystyle\mathcal{L}_{triplet}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[}$		(53)
		$\displaystyle\max\{\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)+\epsilon,0\}\Big{]}$		(53)

where the triplet margin loss does not directly minimize the agreement of the negative sample pair $\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime})$ , but only ensures that the agreement of the negative sample pair is smaller than that of the positive sample pair by a margin value $\epsilon$ . The idea behind is that when the negative samples are sufficiently far apart, i.e., the agreement between them is small enough, there is no need to further reduce their agreement, which helps to focus the training more on those hard samples that are hard to distinguish. The quadruplet loss [56] further considers imposing constraints on inter-class samples on top of the triplet margin loss, defined as:

	$\displaystyle\mathcal{L}_{Quadruplet}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathcal{MI}_{triplet}+\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[}$		(54)
		$\displaystyle\max\{\mathcal{D}\left(\mathbf{h}_{i}^{\prime},\mathbf{h}_{j}^{\prime}\right)-\mathcal{D}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)+\epsilon^{\prime},0\}\Big{]}$		(54)

where $\epsilon^{\prime}$ is a smaller margin value than $\epsilon$ . The quadruplet loss differs from the triplet margin loss in that it not only uses an anchor-based sampling strategy but also samples negative samples in a more random way, which helps to learn more distinguishable inter-class boundaries.

RankMI Loss. While both triplet margin loss and quadruplet loss ignore the lower bound of the mutual information, the RankMI Loss [57] seamlessly incorporates information-theoretic approaches into the representation learning and maximizes the mutual information among samples belonging to the same category, defined as:

	$\displaystyle\mathcal{MI}_{RankMI}$	$\displaystyle\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)=\mathbb{E}_{\left[(\mathbf{A},\mathbf{X}),(\mathbf{A}^{\prime},\mathbf{X}^{\prime})\right]\sim\mathcal{G}\times\mathcal{G}}\Big{[}$		(55)
		$\displaystyle\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})+\log(2-e^{\mathcal{D}(\mathbf{h}_{i}^{\prime},\mathbf{h}_{j}^{\prime})})\Big{]}$		(55)

As RankMI can incorporate margins based on random positive and negative pairs, the quadruple loss can be considered as a special case of RankMI with a fixed margin.

4 Generative Learning

Compared with contrastive methods, the generative methods shown in Fig. 1(b) are based on generative models and treat rich information embedded in the data as a natural self-supervision. In generative methods, the prediction head $g_{\omega}(\cdot)$ is usually called the graph decoder, which is used to perform graph reconstruction. Categorized by how the reconstruction is performed, we summarize generative methods into two categories: (1) graph autoencoding that performs reconstruction in a once-for-all manner; (2) graph autoregressive that iteratively performs reconstruction. In this section, due to space limitations, we present only some representative generative methods and place those relatively less important works in Appendix B.

4.1 Graph Autoencoding

Since the autoencoder [58] was proposed, it has been widely used as a basic architecture for a variety of image and text data. Given restricted access to the graph, the graph autoencoder is trained to reconstruct certain parts of the input data. Depending on which parts of the input graph are given or restricted, various pretext tasks have been proposed, which will be reviewed one by one next.

Graph Completion [12]. Motivated by the success of image inpainting, graph completion is proposed as a pretext task for graph data. It first masks one node by removing part of its features, and then aims to reconstruc masked features by feeding unmasked node features in the neighborhood. For a given node $v_{i}$ , it randomly masks its features $\mathbf{x}_{i}$ with $\widehat{\mathbf{x}}_{i}=\mathbf{x}_{i}\odot\mathbf{m}_{i}$ to obtain a new node feature matrix $\widehat{\mathbf{X}}$ , and then aim to reconstruct masked features. More formally,

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2}

(56)

Here, it just takes one node as an example, and the reconstruction of multiple nodes can be considered in practice. Note that only those unmasked neighborhood nodes can be used to reconstruct the target node for graph completion.

Node Attribute Masking [10] is similar to Graph Completion, but it reconstructs the features of multiple nodes simultaneously, and it no longer requires that the neighboring node features used for reconstruction must be unmasked.

Edge Attribute Masking [11]. This pretext task is specifically designed for graph data with known edge features, and it enables GNN to learn more edge relation information. Similarly, it first randomly masks the features of a edge set $\mathcal{M}_{e}$ . Specifically, it obtains a masked edge feature matrix $\widehat{\mathbf{X}}^{e}$ where $\widehat{\mathbf{x}}^{e}_{i,j}=\mathbf{x}^{e}_{i,j}\odot\mathbf{m}_{i,j}$ for $(v_{i},v_{j})\in\mathcal{M}_{e}$ . More formally,

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X},\widehat{\mathbf{X}}^{e}\right)=\frac{1}{|\mathcal{M}_{e}|}\sum_{(v_{i},v_{j})\in\mathcal{M}_{e}}\left\|\overline{\mathbf{x}}^{e}_{i,j}-\mathbf{x}^{e}_{i,j}\right\|^{2}

(57)

where $\overline{\mathbf{x}}^{e}_{i,j}=(\overline{\mathbf{X}}^{e})_{i,j}$ and $\overline{\mathbf{X}}^{e}=f_{\theta}(\mathbf{A},\mathbf{X},\widehat{\mathbf{X}}^{e})$ .

Node Attribute Denoising [13]. Different from Node Attribute Masking, this pretext task aims to add noise to the node features to obtain a noisy node feature matrix $\widehat{\mathbf{X}}=\mathbf{X}+N(\mathbf{0},\mathbf{\Sigma})$ , and then ask the model to reconstruct the clean node features $\mathbf{X}$ . More formally,

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2}

(58)

where adding noise is only one means of corrupting the image, in addition to blurring, graying, etc. Inspired by this, it can use arbitrary corruption operations $\mathcal{C}(\cdot)$ to obtain the corrupted features and then force the model to reconstruct them. Different from Node Attribute Denoising, which reconstructs raw features from noisy inputs, Node Embedding Denoising aims to reconstructs clean node features $\mathbf{X}$ from noisy embeddings $\widehat{\mathbf{H}}=\mathbf{H}+N(\mathbf{0},\mathbf{\Sigma})$ .

Adjacency Matrix Reconstruction [14]. The graph adjacency matrix is one of the most important information in graph data, which stores the graph structure information and the relations between nodes. This pretext task randomly perturbs parts of the edges in a graph $\mathbf{A}$ to obtain $\widehat{\mathbf{A}}$ , then requires the model to reconstruct the adjacency matrix of the input graph. More formally,

\mathcal{L}_{ssl}\left(\theta,\widehat{\mathbf{A}},\mathbf{X}\right)=\frac{1}{N^{2}}\sum_{i,j}\left(\overline{\mathbf{A}}_{i,j}-\mathbf{A}_{i,j}\right)^{2}

(59)

where $\overline{\mathbf{A}}=f_{\theta}(\widehat{\mathbf{A}},\mathbf{X})$ . During the training process, since the adjacency matrix $\mathbf{A}$ is usually a sparse matrix, it can also use cross-entropy instead of MAE as loss in practice.

4.2 Graph Autoregressive

The autoregressive model is a linear regression model that uses a combination of random variables from previous moments to represent random variables at a later moment.

GPT-GNN [32]. In recent years, the idea of GPT [6] has also been introduced into the GNN domain. For example, GPT-GNN proposes an autoregressive framework to perform node and edge reconstruction on given graph iteratively. Given a graph $g_{t}=(\mathbf{A}_{t},\mathbf{X}_{t})$ with its nodes and edges randomly masked in iteration $t$ , GPT-GNN generates one masked node $X_{i}$ and its connected edges $E_{i}$ to obtain a updated graph $g_{t+1}=(\mathbf{A}_{t+1},\mathbf{X}_{t+1})$ and optimizes the likelihood of the node and edges generation in the next iteration $t+1$ , with the learning objective defined as

	$\displaystyle p_{\theta}\left(\mathbf{X}_{t+1},\mathbf{A}_{t+1}\mid\mathbf{X}_{t},\mathbf{A}_{t}\right)$	(60)
$\displaystyle=$	$\displaystyle\sum_{o}p_{\theta}\left(X_{i},E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)\cdot p_{\theta}\left(E_{i}^{o}\mid\mathbf{X}_{t},\mathbf{A}_{t}\right)$
$\displaystyle=$	$\displaystyle\mathbb{E}_{o}\big{[}p_{\theta}\left(X_{i},E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)\big{]}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{o}\left[p_{\theta}\big{(}\mathbf{X}_{t+1}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)p_{\theta}(E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t+1},\mathbf{A}_{t})\big{]}$

where $o$ is a variable to denote the index vector of all edges within $E_{t}$ in the iteration $t$ . Thus, $E_{t}^{o}$ denotes the observed edges in the iteration $t$ , and $E_{i}^{\neg o}$ denotes the the masked edges (to be generated) in the iteration $t+1$ . Finally, the graph generation process is factorized into a node attribute generation step $p_{\theta}\left(\mathbf{X}_{t+1}\mid E_{i}^{o},\mathbf{X}_{t},\mathbf{A}_{t}\right)$ and an edge generation step $p_{\theta}(E_{i}^{\neg o}\mid E_{i}^{o},\mathbf{X}_{t+1},\mathbf{A}_{t})$ . In practice, GPT-GNN performs node and edge generation iteratively.

5 Predictive Learning

The contrastive methods deal with the inter-data information (data-data pairs), the generative methods focus on the intra-data information, while the predictive methods aim to self-generate informative labels from the data as supervision and handle the data-label relationships. Categorized by how labels are obtained, we summarize predictive methods into four categories: (1) Node Property Prediction. The properties of nodes, such as node degree, are pre-calculated and used as self-supervised labels to perform prediction tasks. (2) Context-based Prediction. Local or global contextual information in the graph can be extracted as labels to aid self-supervised learning, e.g., by predicting the shortest path length between nodes, the model can capture long-distance dependencies, which is beneficial for downstream tasks such as link prediction. (3) Self-Training. Learning with the pseudo-labels obtained from the prediction or clustering in a previous stage or even randomly assigned. (4) Domain Knowledge-based Prediction. Expert knowledge or specialized tools are used in advance to analyze graph data (e.g., biological or chemical data) to obtain informative labels. A comparison of four predictive methods is shown in Fig. 5. In this section, due to space limitations, we present only some representative predictive methods and place those relatively less important works in Appendix C.

5.1 Node-Property Prediction (NP)

An effective way to perform predictive learning is to take advantage of the extensive implicit numerical properties within the graph, e.g. node properties, such as node degree and local clustering coefficient. Specifically, it first defines a mapping $\Omega:\mathcal{V}\rightarrow\mathcal{Y}$ to denote the extraction of statistical labels $y_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}}$ for each node $v_{i}$ from graph $g=(\mathbf{A},\mathbf{X})$ . The learning objective is then formulated as

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-y_{i}\Big{)}^{2}

(61)

where $f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}$ is the predicted label of node $v_{i}$ . With different node properties, the mapping function $\Omega(\cdot)$ can have different designs. If it use node degree as a local node property for self-supervision, we have $y_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}}=\sum_{j=1}^{N}\mathbf{A}_{i,j}$ . For the local clustering coefficients, we have

y_{i}=\Omega(\mathbf{A},\mathbf{X})_{v_{i}}=\frac{2\big{|}\{(v_{m},v_{n})|v_{m}\in\mathcal{N}_{i},v_{n}\in\mathcal{N}_{i}\}\big{|}}{|\mathcal{N}_{i}|\dot{(}|\mathcal{N}_{i}|-1)}

(62)

where the local clustering coefficient is a local coefficient describing the level of node aggregation in a graph. Beyond the above two properties, any other node property (or even a combination of them) can be used as statistical labels to perform the pretext task of Node Property Prediction.

5.2 Context-based Prediction (CP)

Apart from Node Property Prediction, the underlying graph structure information can be further explored to construct a variety of regression-based or classification-based pretext tasks and thus provide self-supervised signals. We refer to this branch of methods as context-based predictive methods because it generally explores contextual information.

$\text{S}^{2}\text{GRL}$ [59]. Motivated by the observation that two arbitrary nodes in a graph can interact with each other through paths of different lengths, $\text{S}^{2}\text{GRL}$ treats the contextual position of one node relative to the other as a source of free and effective supervisory signals. Specifically, it defines the $k$ -hop context of node $v_{i}$ as $\mathcal{C}_{i}^{k}=\{v_{j}|d(v_{i},v_{j})=k\}(k=1,2,\cdots,K)$ , where $d(v_{i},v_{j})$ is the shortest path length between node $v_{i}$ and node $v_{j}$ . In this way, for each target node $v_{i}$ , if a node $v_{j}\in\mathcal{C}_{i}^{k}$ , then the hop count $k$ (relative contextual position) will be assigned to node $v_{j}$ as pseudo-label $y_{i,j}=k$ . The learning objective is defined as predicting the hop count between pairs of nodes, as follows

	$\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)$	$\displaystyle=\frac{1}{NK}\sum_{v_{i}\in\mathcal{V}}\sum_{k=1}^{K}\sum_{v_{j}\in\mathcal{C}_{i}^{k}}\ell\Big{(}f_{w}\big{(}$		(63)
		$\displaystyle f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\big{)},k\Big{)}$		(63)

where $\ell(\cdot)$ denotes the cross entropy loss and $f_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value. Compared with the task of $\text{S}^{2}\text{GRL}$ , the PairwiseDistance [10] has truncated the shortest path longer than 4, mainly to avoid the excessive computational burden and to prevent very noisy ultra-long pairwise distances from dominating the optimization.

PairwiseAttrSim [10]. Due to the neighborhood-based message passing mechanism, the learned representations of two similar nodes in the graph are not necessarily similar, as opposed to two identical images that will yield the same representations in the image domain. Though we would like to utilize local neighborhoods in GNNs to enhance node feature transformation, we still wishes to preserve the node pairwise similarity to some extent, rather than allowing a node’s neighborhood to drastically change it. Thus, the pretext task of PairwiseAttrSim can be established to achieve node similarity preservation. Specifically, it first samples node pairs with the $K$ highest and lowest similarities $\mathcal{S}_{i,h}$ and $\mathcal{S}_{i,s}$ for node $v_{i}$ , given by

	$\displaystyle\mathcal{S}_{i,h}$	$\displaystyle=\left\{\left(v_{i},v_{j}\right)\mid s_{ij}\text{ in top-K of }\left\{s_{ik}\right\}_{k=1,k\neq i}^{N}\right\}$		(64)
	$\displaystyle\mathcal{S}_{i,l}$	$\displaystyle=\left\{\left(v_{i},v_{j}\right)\mid s_{ij}\text{ in bottom-K of }\left\{s_{ik}\right\}_{k=1,k\neq i}^{N}\right\}$		(64)

where $s_{i,j}$ measures the node feature similarity between node $v_{i}$ and node $v_{j}$ (according to cosine similarity). Let $\mathcal{S}_{i}=\mathcal{S}_{i,h}\cup\mathcal{S}_{i,h}$ , the learning objective can then be formulated as a regression problem, as follows

	$\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)$	$\displaystyle=\frac{1}{2NK}\sum_{v_{i}\in\mathcal{V}}\sum_{(v_{m},v_{n})\in\mathcal{S}_{i}}\Big{(}f_{w}\big{(}$		(65)
		$\displaystyle f_{\theta}(\mathbf{A},\mathbf{X})_{v_{m}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{n}}\big{)}-s_{m,n}\Big{)}^{2}$		(65)

where $f_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value.

Distance2Clusters [10]. The PairwiseAttrSim applies a sampling strategy to reduce the time complexity, but still involves sorting the node similarities, which is a very time-consuming operation. Inspired by various unsupervised clustering algorithms [60, 61, 62, 63, 64, 65, 66, 67, 68], if a set of clusters can be pre-obtained, the PairwiseAttrSim can be further simplified to predict the shortest path from each node to the anchor nodes associated with cluster centers, resulting in a novel pretext task - Distance2Clusters. Specifically, it first partitions the graph into $K$ clusters $\{C_{1},C_{2},\cdots,C_{K}\}$ by applying some classical unsupervised clustering algorithms. Inside each cluster $C_{k}$ , the node with the highest degree will be taken as the corresponding cluster center, denoted as $c_{k}$ ( $1\leq k\leq K$ ). Then it can calculate the distance $\mathbf{d}_{i}\in\mathbb{R}^{K}$ from node $v_{i}$ to cluster centers $\{c_{k}\}_{k=1}^{K}$ . The learning objective of Distance2Clusters is defined as

\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{d}_{i}\Big{\|}^{2}

(66)

Meta-path Prediction [69]. A meta-path of length $l$ is a sequence of nodes connected with heterogeneous edges, i.e., $v_{1}\stackrel{{\scriptstyle t_{1}}}{{\longrightarrow}}v_{2}\stackrel{{\scriptstyle t_{2}}}{{\longrightarrow}}\ldots\stackrel{{\scriptstyle t_{l}}}{{\longrightarrow}}v_{l}$ , where $t_{l}\in\mathcal{T}^{e}$ denote the type of $l$ -th edge in the meta-path. Given a set of node pair $\mathcal{S}$ sampled from the heterogeneous graph and $K$ pre-defined meta-path types $\mathcal{M}$ , this pretext task aims to predict if the two nodes $(v_{i},v_{j})\in\mathcal{S}$ are connected by one of the meta-path type $m\in\mathcal{M}$ . Finally, the predictions of the $K$ meta-paths are formulated as $K$ binary classification tasks, as follows

	$\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)$	$\displaystyle=\frac{1}{K\|\mathcal{S}\|}\sum_{m\in\mathcal{M}}\sum_{(v_{i},v_{j})\in\mathcal{S}}$		(67)
		$\displaystyle\ell\Big{(}f_{w}\big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\big{)},\mathbf{Y}_{i,j}^{m}\Big{)}$		(67)

where $\ell(\cdot)$ deontes the cross entropy loss, and $\mathbf{Y}_{i,j}^{m}$ is the ground-truth label where $\mathbf{Y}_{i,j}^{m}=1$ if there exits a meta-path $m$ between node $v_{i}$ and node $v_{j}$ , otherwise $\mathbf{Y}_{i,j}^{m}=0$ .

SLiCE [70]. Different from the pretext task of Meta-path Prediction [69] that requires pre-defined mate-paths, SLiCE automatically learns the composition of different meta-paths for a specific task. Specifically, it first samples a set of nodes $\mathcal{S}$ from the node set $\mathcal{V}$ . Given a node in $v_{i}\in\mathcal{S}$ , it generates a context subgraph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})$ around $v_{i}$ and encodes the context as a low-dimensional embedding matrix $\mathbf{H}_{i}$ . Then it randomly masks a node $v_{i}^{m}$ in graph $g_{i}$ for prediction. Therefore, the pretext task aims to maximize the probability of observing this masked node $v_{i}^{m}$ based on the context $g_{i}$ ,

\begin{split}\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\prod_{v_{i}\in\mathcal{S}}\prod_{v_{i}^{m}\in g_{i}}p\left(v_{i}^{m}\mid\mathbf{H}_{i},\theta\right)\end{split}

(68)

where $p(\cdot\mid\theta)$ can in practice be approximated by a graph neural networks model $f_{\theta}(\cdot)$ parameterized by $\theta$ .

Distance2Labeled [10]. Recent work provides deep insight into existing self-supervised pretext tasks that utilize only attribute and structure information and finds that they are not always beneficial in improving the performance of downstream tasks, possibly because the information mined by the pretext tasks may have been fully exploited during the message passing by the GNN model. Thus, given partial information about downstream tasks, such as a small set of labeled nodes, we can explore label-specific self-supervised tasks. For example, we directly modify the pretext task of Distance2Cluster by combining label information to create a new pretext task - Distance2Labeled. Specifically, it first calculates the average, minimum, and maximum shortest path length from node $v_{i}$ to all labeled nodes in class $\{C_{k}\}_{k=1}^{K}$ , resulting in a distance vector $\mathbf{d}_{i}\in\mathbb{R}^{3K}$ . Finally, the learning objective of Distance2Labeled can be formulated as a distance regression problem, as follows

\displaystyle\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{d}_{i}\Big{\|}^{2}

(69)

Compared with Distance2Cluster, Distance2Labeled utilizes task-specific label information rather than additional unsupervised clustering algorithms to find cluster centers, showing advantages in both efficiency and performance.

5.3 Self-Training (ST)

For self-training methods, the prediction results from the previous stage can be used as labels to guide the training of next stage, thus achieving self-training in an iterative way.

Multi-stage Self-training [71]. This pretext task is proposed to leverage the abundant unlabeled nodes to help training. Given both the labeled set $\mathcal{D}_{L}^{t}$ and unlabeled set $\mathcal{D}_{U}^{t}$ in the iteration step $t$ , the graph encoder $f_{\theta}(\cdot)$ is first trained on the labeled set $\mathcal{D}_{L}^{t}$ , as follows

\mathcal{L}_{node}\left(\theta,\mathbf{A},\mathbf{X},\mathcal{D}_{L}^{t}\right)=\sum_{\left(v_{i},y_{i}^{t}\right)\in\mathcal{D}_{L}^{t}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}^{t}\Big{)}

(70)

and then applied to make predictions $\widehat{\mathcal{Y}}^{t}=\{\widehat{y}_{i}^{t}\mid v_{i}\in\mathcal{V}_{U}^{t}\}$ on the unlabeled set $\mathcal{V}_{U}^{t}$ . Then the predicted labels (as well as corresponding nodes) with $K$ -top high confidence

\mathcal{D}_{N}^{t}=\left\{(v_{i},\widehat{y}_{i}^{t})|\mid\widehat{y}_{i}^{t}\text{ in top-K confidence of }\widehat{\mathcal{Y}}^{t}\right\}

(71)

are considered as the pseudo-labels and moved to the labeled node set $\mathcal{D}_{L}^{t}$ to obtain an updated labeled set $\mathcal{D}_{L}^{t+1}=\mathcal{D}_{L}^{t}\bigcup\mathcal{D}_{N}^{t}$ and an updated unlabeled set $\mathcal{D}_{U}^{t+1}=\mathcal{D}_{U}^{t}/\mathcal{D}_{N}^{t}$ . Finally, a fresh graph encoder is trained on the updated labeled set $\mathcal{D}_{L}^{t+1}$ , and the above operations are performed multiple times in an iterative manner.

Node Clustering or Partitioning [12]. Compared to Multi-stage Self-training, the pretext task of Node Clustering pre-assigns a pseudo-label $y_{i}$ , e.g., the cluster index, to each node $v_{i}$ by some unsupervised clustering algorithms. The learning objective of this pretext task is then formulated as a classification problem, as follows

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}\Big{)}

(72)

When node attributes are not available, another choice to obtain pseudo-labels is based on the topology of a given graph structure or adjacency matrix. Specifically, graph partitioning [72, 73] is to partition the nodes of a graph into roughly equal subsets, such that the number of edges connecting nodes across subsets is minimized. To absorb the advantages of both attributive- and structural-based clustering, CAGNN [74] combines the node clustering and node partitioning to proposed a new pretext task. Concretely, it first assigns cluster indices as pseudo labels but follows a topology refining process that refines the clusters by minimizing the inter-cluster edges.

M3S [75]. Combining Multi-stage Self-training with Node Clustering, M3S applies DeepCluster [68] and the alignment mechanism as a self-checking mechanism, thus providing stronger self-supervision. Specifically, a $K$ -mean clustering algorithm is performed on node representations $\mathbf{H}^{(t)}$ learned in the iteration step $t$ (rather than $\mathbf{X}$ ) and the clustered pseudo-label $\mathcal{D}_{N}^{t}$ that matches the prediction of the classifier in the last iteration step $t-1$ will added to the labeled set to obtain an updated labeled set $\mathcal{D}_{L}^{t+1}=\mathcal{D}_{L}^{t}\bigcup\mathcal{D}_{N}^{t}$ . Finally, a fresh model will be trained on the labeled set $\mathcal{D}_{L}^{t+1}$ with the objective defined as Equ. 70.

Cluster Preserving [17]. An important characteristic of real-world graphs is the cluster structure, so we can consider the cluster preservation as a self-supervised pretext task. The unsupervised clustering algorithms are first applied to group nodes in a graph into $K$ non-overlapping clusters $\{C_{k}\}_{k=1}^{K}$ , then the cluster prototypes can be obtained by $c_{k}=\text{AGGRATE}(\{f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}\mid v_{i}\in C_{k}\})$ . The mapping function $g_{\omega}(\cdot)$ is used to estimate the similarity of node $v_{i}$ with the cluster prototype $c_{k}$ , e.g., the probability $\widehat{y}_{i,k}$ that node $v_{i}$ belongs to cluster $C_{k}$ is defined as follows,

\widehat{y}_{i,k}=\frac{\exp\Big{(}g_{\omega}(f_{\theta}\big{(}\mathbf{A},\mathbf{X})_{v_{i}},c_{k}\big{)}\Big{)}}{\sum_{k=1}^{K}\exp\Big{(}g_{\omega}(f_{\theta}\big{(}\mathbf{A},\mathbf{X})_{v_{i}},c_{k}\big{)}\Big{)}}

(73)

Finally, the objective of Cluster Preserving is defined as

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=-\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\sum_{k=1}^{K}y_{i,k}\log(\widehat{y}_{i,k})

(74)

where the ground-truth label $y_{i,k}=1$ if node $v_{i}$ is grouped into cluster $C_{k}$ , otherwise $y_{i,k}=0$ .

5.4 Domain Knowledge-based Prediction (DK)

The formation of real-world graphs usually obeys specific rules, e.g., the links between atoms in molecular graphs are bounded by valence bonding theory, while cross-cited papers in citation networks generally have the same topic or authors. Therefore, extensive expert knowledge can be incorporated as a prior into the design of pretext tasks.

Contextual Molecular Property Prediction [76] incorporates domain knowledge about biological macromolecules to design molecule-specific pretext tasks. Given a node $v_{i}$ , it samples its $k$ -hop neighborhood nodes and edges as a local subgraph and then extracts statistical properties of this subgraph. Specifically, it counts the number of occurrence of (node, edge) pairs around the center node $v_{i}$ and then list all the node-edge count terms in alphabetical order, which constitutes the final property, e.g., C $\_$ N-DOUBLE1 $\_$ O-SINGLE1 in Fig. 5 (d). With plenty of context-aware properties $\mathcal{P}=\{p_{k}\}_{k=1}^{K}$ pre-defined, the contextual property prediction can be defined as a multi-class prediction problem with one class corresponds to one contextual property, as follows

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\ell\Big{(}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}},y_{i}\Big{)}

(75)

where $\ell(\cdot)$ denotes the cross entropy loss, and $y_{i}=k$ if the molecular property of node $v_{i}$ is $p_{k}$ .

Graph-level Motif Prediction [76]. Motifs are recurrent sub-graphs among the input graph, which are prevalent in molecular graphs. One important class of motifs in molecules are functional groups, which encodes the rich domain knowledge of molecules and can be easily detected by professional softwares, such as RDKit. Suppose considering the presence of $K$ motifs $\{m_{k}\}_{k=1}^{K}$ in the molecular data, then for one specific molecule graph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}$ , it detects whether each motif shows up in $g_{i}$ and use it as the label $\mathbf{y}_{i}\in\mathbb{R}^{K}$ . Specifically, if $m_{k}$ shows up in $g_{i}$ , the $k$ -th elements $\mathbf{y}_{i,k}$ will be set to 1, otherwise 0. Formally, the learning objective of the motif prediction task is formulated as a multi-label classification problem, as follows

\mathcal{L}_{ssl}\left(\gamma,\mathcal{G}\right)=\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\ell\Big{(}f_{\gamma}(\mathbf{A}_{i},\mathbf{X}_{i}),\mathbf{y}_{i}\Big{)}

(76)

where $\ell(\cdot)$ deontes the binary cross entropy loss.

6 Summary of the Implementation

A summary of the surveyed works is presented in Fig. 6, and Appendix D lists their properties, including graph property, pretext task, augmentation, objective function, training strategy, and publication year. Furthermore, we show in Appendix E the implementation details of surveyed works, such as downstream tasks, evaluation metrics, and datasets.

6.1 Downstream Tasks

The graph SSL methods are generally evaluated on three levels of graph tasks: node-level, link-level, and graph-level. Among them, the graph-level tasks are usually performed on multiple graphs in the form of inductive learning. Commonly used graph-level tasks include graph classification and graph regression. The link-level tasks mainly focus on link prediction, that is, given two nodes, the objective is to predict whether a link (edge) exists between them. On the other hand, the node-level tasks are generally performed on a large graph in the form of transductive learning. Depending on whether labels are provided, it can be divided into three categories: node regression, node classification, and node clustering. The node classification and node regression are usually performed with partial known labels. Instead, the node clustering is performed in a more challenging unsupervised manner and adopted when the performance of the node classification is not sufficiently distinguishable.

6.2 Evaluation Metrics

For graph classification tasks, the commonly used evaluation metrics include ROC-AUC and Accuracy (Acc); while for graph regression tasks, Mean Absolute Error (MAE) is used. In terms of link prediction tasks, ROC-AP, ROC-PR, and ROC-AUC are usually used as evaluation metrics. Besides, node regression tasks are usually evaluated by metrics including MAE, Mean Square Error (MSE), and Mean Absolute Percentage Error (MAPE). In addition to Accuracy, node classification tasks also adopt F1-score for single-label classification and Micro-F1 (or Macro-F1) for multi-label classification. Moreover, node clustering tasks often adopt the same metrics used to evaluate the unsupervised clustering, such as Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Accuracy, etc.

6.3 Datasets

The statistics of a total of 41 datasets are available in Appendix F. Commonly used datasets for graph self-supervised learning tasks can be divided into five categories: citation networks, social networks, protein networks, molecule graphs, and others. (1) Citation Networks. In citation networks, nodes usually denote papers, node attributes are some keywords in papers, edges denote cross-citation, and categories are topics of papers. Note that nodes in the citation networks may also sometimes indicate authors, institutions, etc. (2) Social Networks. The social network datasets consider entities (e.g., users or authors) as nodes, their interests and hobbies as attributes, and their social interactions as edges. The widely used social network datasets for self-supervised learning are mainly some classical graph datasets, such as Reddit [8], COLLAB [77]. (3) Molecule Graphs. In molecular graphs, nodes represent atoms in the molecule, the atom index is indicated by the node attributes, and edges represent bonds. Molecular graph datasets typically contain multiple graphs and are commonly used for tasks such as graph classification and graph regression, e.g., predicting molecular properties. (4) Protein Networks. The protein networks can be divided into two main categories - Protein Molecule Graph and Protein Interaction Network - based on the way they are modeled. The Protein Molecule Graph is a particular type of molecule graph, where nodes represent amino acids, and an edge indicates the two connected nodes are less than 6 angstroms apart. The commonly used datasets include PROTEINS [78] and D&D [78], used for chemical molecular property prediction. The other branch is Protein Interaction Networks, where nodes denote protein molecules, and edges indicate their interactions. The commonly used dataset is PPI [79], used for graph biological function prediction. (5) Other Graphs. In addition to the four types of datasets mentioned above, there are some datasets that are less common or difficult to categorize, such as image, traffic, and co-purchase datasets.

6.4 Codes in Github

The open-source codes are beneficial to the development of the deep learning community. A summary of the open-source codes of 71 surveyed works is presented in Appendix G, where we provide hyperlinks to their open-source codes. Most of these methods are implemented on GPUs based on Pytorch or Tensorflow libraries. Moreover, we have created a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning to summarize the latest advances in graph SSL, which will be updated in real-time as more papers and their codes become available.

6.5 Experimental Study

To make a fair comparison, we select two important downstream tasks, node classification and graph classification, and provide the classification performance of various classical algorithms on 15 commonly used graph datasets in Appendix H. Due to space limitations, please refer to the appendix for more experimental results and analysis.

7 Discussion

We begin with some discussion and summary of the connections and developments between various methods. To present a clearer picture of the development lineage of various graph SSL methods, we provide a complete timeline in Appendix I, listing the publication dates of some key milestones. Besides, we provide the inheritance connections between methods to show how they are developed. Furthermore, we provide short descriptions of contributions to some seminal works to highlight their importance.

DGI [9] is a pioneering work for graph contrastive learning, originally designed specifically for node classification on attribute graphs, and later extended to other types of graphs, resulting in new variants such as HDGI [31] for heterogeneous graphs, STDGI [28] for spatial-temporal graphs, DMGI [80] for multiplexed graphs, and BiGI [37] for bipartite graphs. Besides, InfoGraph [49] extends DGI to global-context contrasting, achieving state-of-the-art performance on multiple graph classification datasets. With the focus shifted from local nodes and global graphs to subgraphs, GCC [38] proposes the first subgraph-level context-context contrasting framework, where subgraphs sampled from the same graph are considered as the positive pair.

Different from DGI, GRACE [26] focuses on contrasting views at the node-level by generating multiple augmented graphs through handcrafted augmentations and then encouraging consistency between the same nodes in different views. GCA [40] adopts a similar framework to GRACE, but focuses on designing the adaptive augmentation strategy. Similarly, GROC [43] claims that gradient information can be used to guide data augmentation and proposes a gradient-based graph topology augmentation that further improves the performance of GRACE and GCA. The same local-local contrasting as GRACE, but BGRL [27] is inspired by BYOL [81] and explores for the first time whether negative samples are a must for graph contrastive learning.

The focus on three different levels of nodes, edges, and structures has led to different generative methods such as Graph Completion [12], Edge Feature Masking [11], and Adjacency Matrix Reconstruction [14], respectively. Moreover, due to their simplicity and effectiveness, these methods have been widely used in algorithms such as GPT-GNN [32] and recommendation applications such as Pretrain-Recsys [82]. In terms of predictive methods, the basic difference between methods is how to obtain pseudo-labels, and there are three main means: (1) numerical statistics, such as Node Property Prediction and PairwiseAttrSim [10]; (2) prediction results from the previous training stage, such as CAGNN [74] and M3S [75]; (3) domain knowledge, such as Molecular Property Prediction and Global Motif Prediction [76].

Discussion on Pros and Cons. Next, we will discuss the advantages and disadvantages of some classical algorithms on four aspects: innovation, accessibility, effectiveness (performance), and efficiency, based on which we divide existing algorithms into four categories: (1) Pioneering work. Representative works include DGI [9], InfoGraph [49], HDGI [31], STDGI [28], etc. These methods, for the first time, apply contrastive learning to a novel downstream task or graph type, showing promising innovations and inspiring many follow-up researches. However, as early attempts, they often perform relatively poorly on downstream tasks and with high computational complexity, compared to some subsequent works. (2) Knowledge-based work. There are some works that combine self-supervised learning techniques with prior knowledge to obtain excellent performance on downstream tasks. For example, Molecular Property Prediction [76] combines domain knowledge, i.e., molecular properties, while LCC [45] introduces label information into the computation of supervision signals. These knowledge-based methods usually achieve fairly good performance due to the introduction of additional information, but this also limits their applicability to other tasks and graph types. (3) There is some work aimed at building on existing work and pursuing state-of-the-art performance on a variety of datasets, among which representative works include K2SL [83], BGRL [27], and SUGAR [84]. While these works have achieved state-of-the-art performance on a variety of downstream tasks, they are largely incremental contributions to previous seminal work (e.g., DGI) and are relatively weak on innovation and accessibility. (4) Most generative and predictive methods are less effective than contrastive methods, but are generally very simple to implement, easy to combine with existing frameworks, have lower computational complexity, and exhibit better applicability and efficiency.

Pretext Tasks for Complex Types of Graph. Most of the existing work is focused on the design of pretext tasks, especially on attribute graphs, with little effort to other more complex graph types, such as spatial-temporal and heterogeneous graphs. Moreover, these pretext tasks usually utilize only node-level or graph-level features, limiting their ability to exploit richer information, such as temporal information in spatial-temporal graphs and relation information in heterogeneous graphs. As a result, how to design more suitable pretext tasks can be considered from three aspects: (1) designing graph type-specific pretext tasks that adaptively pick the most suitable tasks depending on the type of graph; (2) incorporating temporal or heterogeneous information (in the form of prior knowledge) into the pretext task design; (3) taking the automated design of pretext tasks as a new research topic from the perspective of automatic learning.

Lack of Theoretical Foundation. Despite the great success of graph SSL on various tasks, they mostly draw on the successful experience of SSL on CV and NLP domains. In other words, most existing graph SSL methods are designed with intuition, and their performance gains are evaluated by empirical experiments. The lack of sufficient theoretical foundations behind the design has led to both performance bottlenecks and poor explainability. Therefore, we believe that building a solid theoretical foundation for graph SSL from a graph theory perspective and minimizing the gap between the theoretical foundation and empirical design is also a promising future direction. For example, an important problem for graph SSL is whether mutual information maximization is the only means to achieve graph contrastive learning? Such problems have been explored in [54] for image data, but how to extend them to the graph domain is not yet available. In Sec. 3.4, in addition to MI estimators such as InforNCE, we have introduced some contrastive objectives that are not based on mutual information, such as triplet margin and quadruplet loss. However, how to theoretically analyze the connection between these losses and mutual information needs to be further explored.

Insufficient Augmentation Strategy. Recent advances in the field of visual representation learning [2, 3] are mainly attributed to a variety of data augmentation strategies, such as resize, rotation, coloring, etc. However, due to the inherent non-Euclidean nature of graph data, it is difficult to directly apply existing image-based augmentation to graphs. Moreover, most augmentation strategies on graphs are limited to adding/removing nodes and edges or their combination to achieve the asserted SOTA. To further improve the performance of SSL on graphs, it is a promising direction to design more efficient augmentation strategies. More importantly, the design of the augmentation strategy should follow some well-designed guidelines instead of relying entirely on subjective intuition. In summary, we argue that the design of graph augmentation should be based on the following four guidelines: (1) Applicability, graph augmentation should ideally be a plug-and-play module that can be easily combined with the existing self-supervised learning frameworks. (2) Adaptability, some work [10, 27] have pointed out that different datasets and task types may require different augmentations, so how to design the date-specific and task-specific augmentation strategy is a potential research topic. (3) Efficiency, data augmentation should be a lightweight module that does not bring a huge computational burden to the original implementation. (4) Dynamic, with the ability to dynamically update the augmentation strategy as the training proceeds.

Inefficient Negative Sampling Strategy. The selection of high-quality negative samples is a crucial issue. The most common sampling strategy is uniform sampling, but this has been shown to be very informative [85, 86, 87]. The problem of how to better obtain negative samples has been well explored in the field of computer vision. For example, [85] presents the debiased contrastive learning that directly corrects the sampling bias of negative samples. Besides, [86] takes advantage of mixup techniques to directly synthesize hard negative samples in the embedding space. Moreover, [87] develops a family of unsupervised sampling strategies for user-controllable negative sample selection. Despite the great success, these methods, specifically designed for image data, may be difficult to apply directly to non-Euclidean graph data. More importantly, accurate estimation of hard negative samples becomes more difficult when label information is not available. Therefore, how to reduce the gap between ideal and practical contrastive learning by a decent negative sampling strategy requires more exploration.

Lack of Explainability. Though existing graph SSL methods have achieved excellent results on various downstream tasks, we still do not know exactly what has been learned from self-supervised pretext tasks. Which of the feature patterns, significant structures, or feature-structure relationships has been learned by self-supervision? Is this learning explicit or implicit? Is it possible to find interpretable correspondences on the input data? These are important issues for understanding and interpreting model behavior but are missing in current graph SSL works. Therefore, we need to explore the interpretability of graph SSL and perform a deep analysis of model behavior to improve the generalization and robustness of existing methods for security- or privacy-related downstream tasks.

Margin from Pre-training to Downstream Tasks. Pre-training with self-supervised tasks and then using the pre-trained model for specific downstream tasks, either by fine-tuning or freezing the weights, is a common training strategy in graph SSL [82, 88, 89]. However, how shall we transfer the pre-trained knowledge to downstream tasks? Though numerous strategies have been proposed to address this problem in the CV and NLP domains [90], they are difficult to apply directly to graphs due to the inherent non-Euclidean structure of graphs. Therefore, it is an important issue to design graph-specific techniques to minimize the margin between pre-training and downstream tasks.

8 Conclusion

A comprehensive survey of the literature on graph self-supervised learning techniques is conducted in this paper. We develop a unified mathematical framework for graph SSL. Moreover, we summarize the implementation details in each work and show their similarities and differences. More importantly, we are the first survey to provide a detailed experimental study on self-supervised learning, setting the stage for the future development of graph SSL. Finally, we point out the technical limitations of the current research and provide promising directions for future work on graph SSL. We hope this survey will inspire follow-up researchers to focus on other important but easy-to-miss details such as theoretical foundations, explainability, etc., in addition to model performance on downstream tasks.

A. Contrastive Methods

In this section, we will continue with some contrastive methods for graph SSL, but to avoid over-redundancy, we will not provide detailed mathematical formulas for some relatively less important works. These methods include (1) methods that are essentially the same as the frameworks already presented in the main paper with only minor differences; (2) methods that have not been accepted and are only publicly available on platforms such as arxiv, OpenReview, etc.; (3) application methods, where graph SSL is only one of the adopted techniques or tricks and is not the focus of their works; (4) methods with only minor performance improvement on the benchmark datasets.

A.1 Global-Global Contrasting

Iterative Graph Self-Distillation (IGSD) [33] is a graph self-supervised learning paradigm that iteratively performs the teacher-student distillation with graph augmentations. Specifically, given a graph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}$ , it performs a series of augmentation transformations $\mathcal{T}(\cdot)$ to generate an augmented graph $\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i})$ . Then two graph encoder $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ as well as a READOUT function are applied to obtain the graph-level representations $\mathbf{h}_{g_{i}}=\text{READOUT}\left(f_{\theta_{1}}(\mathbf{A}_{i},\mathbf{X}_{i})\right)$ and $\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}=\text{READOUT}(f_{\theta_{2}}(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i}))$ , respectively. Moreover, a prediction head $g_{\omega}(\cdot)$ is used in the student network to obtain $g_{\omega}(\mathbf{h}_{g_{i}})$ and $g_{\omega}(\widetilde{\mathbf{h}}_{\widetilde{g}_{i}})$ . To contrast representations $\mathbf{h}_{g_{i}}$ and $\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}$ , the symmetric consistency loss is defined as

\mathcal{D}\left(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right)=\left\|g_{\omega}(\mathbf{h}_{g_{i}})-\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}\right\|^{2}+\left\|g_{\omega}(\widetilde{\mathbf{h}}_{\widetilde{g}_{i}})-\mathbf{h}_{g_{i}}\right\|^{2}

(77)

Finally, the learning objective is defined as follows

\max_{\theta_{1},\omega}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\log\frac{\exp\left(\mathcal{D}(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{i}})\right)}{\sum_{j=1}^{N}\exp\left(\mathcal{D}(\mathbf{h}_{g_{i}},\widetilde{\mathbf{h}}_{\widetilde{g}_{j}})\right)}

(78)

where the negative samples to contrast with $\mathbf{h}_{g_{i}}$ is $Neg(\mathbf{h}_{g_{i}})=\{\widetilde{\mathbf{h}}_{\widetilde{g}_{j}}\}_{g_{j}\in\mathcal{G},j\neq i}$ . The parameter $\theta_{2}$ of teacher network $f_{\theta_{2}}(\cdot)$ are updated as an exponential moving average (EMA) of the student network parameters $\theta_{1}$ , as follows

\theta_{2}\longleftarrow\alpha\theta_{2}+(1-\alpha)\theta_{1}

(79)

where $\alpha\in[0,1)$ is the momentum weight to control the evolving speed of $\theta_{2}$ .

Domain-Agnostic Contrastive Learning (DACL) [91] uses the Mixup noise to create positive and negative examples by mixing data samples either at the input or embedding spaces. Specifically, given a graph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}$ , it applies a graph encoder $f_{\theta}(\cdot)$ and a READOUT function to obtain the graph-level representation $\mathbf{h}_{g_{i}}=\text{READOUT}(f_{\theta}(\mathbf{A}_{i},\mathbf{X}_{i}))$ . Then it performs the mixup transformations to generate two positive pairs , e.g., $\mathbf{h}^{(1)}_{g_{i}}=\lambda_{1}\mathbf{h}_{g_{i}}+(1-\lambda_{1})\mathbf{h}_{g_{m}}$ and $\mathbf{h}^{(2)}_{g_{i}}=\lambda_{2}\mathbf{h}_{g_{i}}+(1-\lambda_{2})\mathbf{h}_{g_{n}}$ , where $\lambda_{1}$ and $\lambda_{2}$ are sampled from a Gaussian distribution and $\mathbf{h}_{g_{m}}$ and $\mathbf{h}_{g_{n}}$ are randomly sampled from $\{\mathbf{h}_{g_{k}}\}_{k=1,k\neq i}^{N}$ , respectively. Finally, the learning objective is defined as

\max_{\theta,\omega}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(g_{\omega}(\mathbf{h}^{(1)}_{g_{i}}),g_{\omega}(\mathbf{h}^{(2)}_{g_{i}})\right)

(80)

where $g_{\omega}(\cdot)$ is a nonlinear prediction head, and the contrasted negative samples is $Neg\left(g_{\omega}(\mathbf{h}^{(1)}_{g_{i}})\right)=\{g_{\omega}(\mathbf{h}^{(c)}_{g_{j}})\}_{j\neq i;c=1,2}$ .

A.2 Local-Local Contrasting

KS2L [83] is a novel self-supervised knowledge distillation framework, with two complementary intra- and cross-model knowledge distillation modules. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first applies two linear mapping functions to obatain $\mathbf{Z}^{(1)}=g_{\omega_{1}}(\mathbf{X})$ and $\mathbf{Z}^{(2)}=g_{\omega_{2}}(\mathbf{X})$ and then uses two graph encoder $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ to obtain node embedding matrices $\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A},\mathbf{Z}^{(1)})$ and $\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A},\mathbf{Z}^{(2)})$ . Finally, the learning objective is defined as follows

\max_{\theta_{1},\theta_{2},\omega_{1},\omega_{2}}\frac{1}{2|\mathcal{E}|}\sum_{e_{i,j}\in\mathcal{E}}\mathcal{MI}\left[\left(\mathbf{h}_{i}^{(1)},\mathbf{z}_{j}^{(1)}\right)+\left(\mathbf{h}_{i}^{(1)},\mathbf{z}_{j}^{(2)}\right)\right]

(81)

where the first term is used for intra-model knowledge distillation, and the negative samples to contrast with $\mathbf{h}_{i}^{(1)}$ is $Neg(\mathbf{h}_{i}^{(1)})=\{\mathbf{z}_{k}^{(1)}\}_{e_{i,k}\notin\mathcal{E}}$ . The second term is used for cross-model knowledge distillation, and the negative samples to contrast with $\mathbf{h}_{i}^{(1)}$ is $Neg(\mathbf{h}_{i}^{(1)})=\{\mathbf{z}_{k}^{(2)}\}_{e_{i,k}\notin\mathcal{E}}$ .

$\text{C}\text{G}^{3}$ [92] adopts a similar framework to GRACE [26], but perfroms the same label-level contasting as LCC [45]. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first applies a localized graph convolution encoder $f_{\theta_{1}}(\cdot)$ and a hierarchical graph convolution encoder $f_{\theta_{2}}(\cdot)$ to generate their node embedding matrices $\mathbf{H}^{(1)}=f_{\theta_{1}}(\mathbf{A},\mathbf{X})$ and $\mathbf{H}^{(2)}=f_{\theta_{2}}(\mathbf{A},\mathbf{X})$ . To incorporate the scarce yet valuable label information for training, it uses a supervised label contrastive loss as follows:

\max_{\theta}\frac{1}{2N}\sum_{i=1}^{N}\left[\mathcal{L}^{(1)}_{lc}+\mathcal{L}^{(2)}_{lc}\right]

(82)

where $\mathcal{L}^{(1)}_{lc}$ and $\mathcal{L}^{(2)}_{lc}$ are defined as

\mathcal{L}^{(1)}_{lc}=\log\frac{\sum_{k=1}^{m}\mathbb{I}_{y_{k}=y_{i}}\cdot\exp\left(\mathbf{h}_{i}^{(1)}\cdot\mathbf{h}_{k}^{(2)}\right)}{\sum_{j=1}^{m}\exp\left(\mathbf{h}_{i}^{(1)}\cdot\mathbf{h}_{j}^{(2)}\right)}

(83)

\mathcal{L}^{(2)}_{lc}=\log\frac{\sum_{k=1}^{m}\mathbb{I}_{y_{k}=y_{i}}\cdot\exp\left(\mathbf{h}_{i}^{(2)}\cdot\mathbf{h}_{k}^{(1)}\right)}{\sum_{j=1}^{m}\exp\left(\mathbf{h}_{i}^{(2)}\cdot\mathbf{h}_{j}^{(1)}\right)}

(84)

where $m$ is the number of all labeled nodes, and $\mathbb{I}_{y_{k}=y_{i}}$ is an indicator function to determine whether the label of node $v_{k}$ is the same as that of node $v_{i}$ .

PT-DGNN [89] is proposed to perfrom pre-training on dynamic graph $g=(\mathbf{A}^{(t)},\mathbf{X})$ , where $\mathbf{A}^{(t)}_{i,j}=1$ denotes the interaction between node $v_{i}$ and $v_{j}$ at time $t$ , that is, $t$ is the timestamp of edge $e_{i,j}$ . It first applies the dynamic subgraph sampling (DySS) to obtain a subgraph $g_{S}=(\mathbf{A}_{s},\mathbf{X}_{s})$ for pre-training. DySS mainly includes three steps: 1) Randomly select $s$ nodes as the sampling initial points; 2) Put the first-order neighbors of these nodes into the candidate pool and save their timestamp as weight to calculate sampling probability; 3) Select final $s$ nodes according to sampling probability. Furthermore, it genetates an augmentated graph $g_{a}=(\mathbf{A}_{a},\mathbf{X}_{a})=\mathcal{T}(\mathbf{A}_{s},\mathbf{X}_{s})$ by performing the attribute masking and edge perturbation to obtain , where the masked node and edge set are denoted as $\mathcal{S}$ and $\mathcal{M}$ , respectively. Then a shared graph encoder $f_{\theta}(\cdot)$ is applied to obtain the node embedding matrix $\mathbf{H}=f_{\theta}(\mathbf{A}_{a},\mathbf{X}_{a})$ . Finally, the objective of dynamic edge generation is defined as

\max_{\theta}\frac{1}{|\mathcal{V}_{a}|}\sum_{v_{i}\in\mathcal{V}_{a}}\sum_{e_{i,j}\in\mathcal{M}}\log\frac{\exp\Big{(}\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{j})\Big{)}}{\sum_{v_{k}\in\mathcal{V}_{a}}\exp\Big{(}\mathcal{D}(\mathbf{h}_{i},\mathbf{h}_{k})\Big{)}}

(85)

where $\mathcal{V}_{a}$ is the node set of graph $g_{a}$ . The learning objective of the node attribute generation is defined as follows

\min_{\theta,\omega}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\big{\|}g_{\omega}(\mathbf{h}_{i})-\mathbf{x}_{i}\big{\|}^{2}

(86)

where $g_{\omega}(\cdot)$ is a node-level nonlinear prediction head.

COAD [93] applies graph SSL to expert linking, which aims at linking any external information of persons to experts in AMiner. Specifically, it first pre-trains the model by local-local contrastive learning with the triplet margin loss and then fine-tunes the model by adversarial learning to improve the model transferability.

Contrast-Reg [29] is a lightweight local-local contrastive regularization term that adopts the InfoNCE loss to contrasts the node representation similarities of semantically similar (positive) pairs against those of negative pairs. Extensive theoretical analysis demonstrates that Contrast-Reg avoids the high scale of node representation norms and the high variance among them to improve the generalization performance.

C-SWM [94] models the structured environments, such as multi-object systems, as graphs, and then utilizes a local-local contrastive approach to perform the representation learning from environment interactions without supervision.

A.3 Local-Global Contrasting

High-order Deep Multiplex Infomax (HDMI) [30] is proposed to achieve higher-order mutual information maximization. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first applies an augmentation transformation $\mathcal{T}(\cdot)$ to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then a shared graph encoder $f_{\theta}(\cdot)$ is applied to obtain node embedding matrices $\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})$ and $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ . Beside, a $\operatorname{READOUT}()$ function is used to obtain the graph-level representaion $\mathbf{h}_{g}=\operatorname{READOUT}(\mathbf{H})$ . Finally, the learning objective is defined as follows

\displaystyle\max_{\theta,\lambda}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}[\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{i}\right)+\mathcal{MI}\left(\mathbf{x}_{i},\mathbf{h}_{i}\right)+\mathcal{MI}\left(\mathbf{h}_{i},\mathbf{m}_{i}\right)]

(87)

where $\mathbf{m}_{i}=\sigma(\mathbf{W}_{m}[\sigma(\mathbf{W}_{h}\mathbf{x}_{i});\sigma(\mathbf{W}_{g}\mathbf{h}_{g})])$ , and $\lambda=(\mathbf{W}_{m}$ , $\mathbf{W}_{h}$ , $\mathbf{W}_{g})$ are the weight parameters. The negative samples to contrast with $\mathbf{h}_{g}$ in the first term is $Neg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}_{i}$ . The negative samples to contrast with $\mathbf{x}_{i}$ in the second term is $Neg(\mathbf{x}_{i})=\widetilde{\mathbf{h}}_{i}$ . The negative samples to contrast with $\mathbf{h}_{i}$ in the third term is $Neg(\mathbf{h}_{i})=\widetilde{\mathbf{m}}_{i}=\sigma(\mathbf{W}_{m}[\sigma(\mathbf{W}_{h}\widetilde{\mathbf{x}}_{i});\sigma(\mathbf{W}_{g}\mathbf{h}_{g})])$ .

Deep Multiplex Graph Infomax (DMGI) [80] extents the idea of DGI to multiplex graphs where nodes are connected by multiple types of relations. For each relation type $r\in\mathcal{R}$ (corresponding to a relation graph $g^{(r)}=(\mathbf{A}^{(r)},\mathbf{X})$ ), a relation-type graph encoder $f_{\theta_{r}}(\cdot)$ is applied to obtain the relation-specific node embedding matrix $\mathbf{H}^{(r)}=f_{\theta_{r}}(\mathbf{A}^{(r)},\mathbf{X})$ . Besides, it employs a $\operatorname{READOUT}$ function to obatain the graph-level representation $\mathbf{h}_{g^{(r)}}=\operatorname{READOUT}(\mathbf{H}^{(r)})$ . Finally, it independently maximizes the mutual Information between the node embeddings $\mathbf{H}^{(r)}=\left\{\mathbf{h}_{1}^{(r)},\mathbf{h}_{2}^{(r)},\ldots,\mathbf{h}_{N}^{(r)}\right\}$ and the graph-level summary $\mathbf{h}_{g^{(r)}}$ pertaining to each graph $g^{(r)}$ , defined as

\max_{\theta_{r}}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{g^{(r)}},\mathbf{h}^{(r)}_{i}\right),r\in\mathcal{R}

(88)

where the negative samples to contrast with $\mathbf{h}_{g^{(r)}}$ is $Neg(\mathbf{h}_{g^{(r)}})=\{\mathbf{h}_{j}^{(r)}\}_{v_{j}\in\mathcal{V},j\neq i}$ . Similarly, it can learn another embedding matrix from the augmented graph $\widetilde{g}^{(r)}=(\widetilde{\mathbf{A}}^{(r)},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A}^{(r)},\mathbf{X})$ and also maximize the MI of the node embeddings $\widetilde{\mathbf{H}}^{(r)}=\left\{\widetilde{\mathbf{h}}_{1}^{(r)},\widetilde{\mathbf{h}}_{2}^{(r)},\ldots,\widetilde{\mathbf{h}}^{(r)}_{N}\right\}$ and the graph-level summary $\widetilde{\mathbf{h}}_{\widetilde{g}^{(r)}}$ . However, as each $\mathbf{H}^{(r)}$ is trained independently for each $r\in\mathcal{R}$ , these embedding matrices may fail to take advantage of the multiplexity of the network. Therefore, DMGI learns another consensus embedding matrix $\mathbf{Z}$ on which relation-specific node embedding matrices $\mathbf{H}^{(r)}$ and $\widetilde{\mathbf{H}}^{(r)}$ can agree with each other by optimizing the learning objective, as follows

\ell_{\mathrm{cs}}=\Big{[}\mathbf{Z}-\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\mathbf{H}^{(r)}\Big{]}^{2}-\Big{[}\mathbf{Z}-\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\widetilde{\mathbf{H}}^{(r)}\Big{]}^{2}

(89)

where $\mathbf{Z}$ is defined as a set of trainable parameters.

HDGI [31]. A meta-path of length $l$ is a sequence of nodes connected with heterogeneous edges, i.e., $\Phi:v_{1}\stackrel{{\scriptstyle t_{1}}}{{\longrightarrow}}v_{2}\stackrel{{\scriptstyle t_{2}}}{{\longrightarrow}}\ldots\stackrel{{\scriptstyle t_{l}}}{{\longrightarrow}}v_{l}$ , where $t_{l}\in\mathcal{T}^{e}$ denote the type of $l$ -th edge in the meta-path. Given a meta-path $\Phi$ , if there exist a meta-path $\Phi$ between node $v_{i}$ and node $v_{j}$ , it defines that $v_{i}$ and $v_{j}$ are connected neighbors based on the meta-path $\Phi$ , thus obtaining a meta-path based adjacent matrix $\mathbf{A}^{\Phi}$ . Given a meta-path set $\{\Phi_{k}\}_{k=1}^{K}$ , we can obtain $K$ meta-path based adjacent matrix $\left\{\mathbf{A}^{\Phi_{k}}\right\}_{k=1}^{K}$ . HDGI first applies the augmentation to obtain $\widetilde{\mathbf{X}},\{\widetilde{\mathbf{A}}^{\Phi_{k}}\}_{k=1}^{K}=\mathcal{T}(\mathbf{X},\{\mathbf{A}^{\Phi_{k}}\}_{k=1}^{K})$ . Then $K$ graph encoder are applied to obtain node embedding matrices $\mathbf{H}^{(k)}=f_{\theta_{k}}(\mathbf{A}^{\Phi_{k}},\mathbf{X})$ and $\widetilde{\mathbf{H}}^{(k)}=f_{\theta_{k}}(\widetilde{\mathbf{A}}^{\Phi_{k}},\widetilde{\mathbf{X}})$ $(1\leq k\leq K)$ . To obtain the more general node representations, the attention mechanism can be applied to fuse these representations $\mathbf{H}^{fuse}=\sum_{k=1}^{K}s_{k}\mathbf{H}^{(k)}$ , where the attention scores $\{s_{k}\}_{k=1}^{N}$ are defined as follow

\displaystyle s_{k}\!=\!\frac{\exp(e^{(k)})}{\sum_{k=1}^{K}\exp(e^{(k)})},e^{(k)}\!=\!\frac{1}{N}\sum_{i=1}^{N}\tanh(\mathbf{q}^{T}[\mathbf{W}\mathbf{h}^{(k)}_{i}+\mathbf{b}])

(90)

where $1\leq k\leq K$ , $\mathbf{W}$ and $\mathbf{b}$ are the shared weight matrix and shared bias vector, and $\mathbf{q}$ is a shared attention vector. Similarity, it can obtain the fused representation $\widetilde{\mathbf{H}}^{fuse}=\sum_{k=1}^{K}\widetilde{s}_{k}\widetilde{\mathbf{H}}^{(k)}$ . Besides, a $\operatorname{READOUT}(\cdot)$ function are applied to obtain the global-level representation $\mathbf{h}_{g}=\operatorname{READOUT}(\mathbf{H}^{fuse})$ . The objective is defined as:

\max_{\theta_{1},\theta_{2},\cdots,\theta_{K},\gamma}\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}^{fuse}_{i}\right)

(91)

where $\gamma=(\mathbf{W},\mathbf{b},\mathbf{q})$ are the weight parameters, and the negative samples to contrast with $\mathbf{h}_{g}$ is $Neg(\mathbf{h}_{g})=\widetilde{\mathbf{h}}^{fuse}_{i}$ .

DITNet [95] propose an end–to–end model to predict drug-target interactions on heterogeneous graphs. Specifically, it learns high-quality representations for downstream tasks by performing local-gobal and context-global contrasting.

A.4 Local-Context Contrasting

Context Prediction [11] is proposed to maps nodes appearing in similar structural contexts to similar embeddings. It first picks up an anchor node set $\mathcal{S}$ from $\mathcal{V}$ . Given an anchor node $v_{i}\in\mathcal{S}$ , it defines its $K$ -hop neighborhood graph $g_{i}^{(1)}=(\mathbf{A}_{i}^{(1)},\mathbf{X}_{i}^{(1)})$ as all nodes and edges that are at most $K$ -hops away from $v_{i}$ . The context graph represents a subgraph $g_{2}=(\mathbf{A}_{i}^{(2)},\mathbf{X}_{i}^{(2)})$ that is between $r_{1}$ -hops (it requires $r_{1}<K$ ) and $r_{2}$ -hops away from $v_{i}$ (i.e., it is a ring of width $r_{2}-r_{1}$ ). Two graph encoder $f_{\theta_{1}}(\cdot)$ and $f_{\theta_{2}}(\cdot)$ are then applied to encode two graphs as node embedding matrices $\mathbf{H}_{i}^{(1)}=f_{\theta_{1}}(\mathbf{A}_{i}^{(1)},\mathbf{X}_{i}^{(1)})$ and $\mathbf{H}_{i}^{(2)}=f_{\theta_{2}}(\mathbf{A}_{i}^{(2)},\mathbf{X}_{i}^{(2)})$ . Besides, a $\operatorname{READOUT}(\cdot)$ function are further applied to obtain the subgraph-level representation, as follows

\mathbf{h}^{(g)}_{i}=\frac{1}{|\mathcal{M}_{i}|}\sum_{v_{j}\in\mathcal{M}_{i}}(\mathbf{H}_{i}^{(2)})_{v_{j}}

(92)

where $\mathcal{M}_{i}$ is the set of those nodes that are shared between the neighborhood graph $g_{i}^{(1)}$ and the context graph $g_{i}^{(2)}$ , and we refers to those nodes as context anchor nodes. Finally, the learning objective is defined as follows

\max_{\theta_{1},\theta_{2}}\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\mathcal{MI}\left((\mathbf{H}^{(1)}_{i})_{v_{i}},\mathbf{h}_{i}^{(g)}\right)

(93)

where $(\mathbf{H}^{(1)}_{i})_{v_{i}}$ is the node representation of anchor node $v_{i}$ in the node embedding matrix $\mathbf{H}^{(1)}_{i}$ . The negative samples to contrast with $(\mathbf{H}^{(1)}_{i})_{v_{i}}$ is $Neg((\mathbf{H}^{(1)}_{i})_{v_{i}})=\{\mathbf{h}_{j}^{(g)}\}_{v_{j}\in\mathcal{S},j\neq i}$ .

GraphLog [96] is a unified graph SSL framework. In addition to the local-local contrasting (similar to GRACE [26]) and global-global contrasting (similar to GraphGL [25]), GraphLoG leverages an RPCL[97]-based clustering , e.g., $K$ -means clustering, to learn hierarchical prototypes of graph data and then perform local-context contrasting (similar to GIC [48]) for each hierarchical layer, respectively.

MHCN [98] adopts a DGI-like auxiliary task to enhance social recommendation by leveraging high-order user relations. However, considering that the DGI-like local-global contrasting [9] stays at a coarse level and there is no guarantee that the encoder can distill sufficient information from the input data, MHCN extends the DGI to a fine-grained local-context contrasting by exploiting the hierarchical structure in the input hypergraphs.

EGI [36] considers transfer learning on graphs, i.e., the pre-training of GNNs. Specifically, unlike DGI that models the local-global mutual information, EGI samples a set of ego-subgraph and then directly optimizes the local-context MI maximization between the structural input and output of GNN, with a particular focus on the structural information.

A.5 Context-Global Contrasting

SUGAR [84] is a novel hierarchical subgraph-level framework, which encourages subgraph embedding to be mindful of the global structural properties by maximizing their mutual information. For a given $g=(\mathbf{A},\mathbf{X})\in\mathcal{G}$ , it first applies an augmentation transformation to obtain $\widetilde{g}=(\widetilde{\mathbf{A}},\widetilde{\mathbf{X}})=\mathcal{T}(\mathbf{A},\mathbf{X})$ . Then it samples subgraphs from each graph and collect them into a subgraph candidate pool. Besides, a shared graph encoder $f_{\theta}(\cdot)$ and a READOUT funtion are applied to encode these subgraphs into subgraph-level representations $\{\mathbf{h}_{1},\mathbf{h}_{2},\cdots\}$ and $\{\widetilde{\mathbf{h}}_{1},\widetilde{\mathbf{h}}_{2},\cdots\}$ . Next, $K$ striking subgraphs are selected from candidate pool by a reinforcement learning module and pooled into a sketched graph where each node $v_{k}$ ( $1\leq k\leq K$ ) corresponds to a subgraph with embedding $\mathbf{h}_{k}$ for graph $g$ ( $\widetilde{\mathbf{h}}_{k}$ for graph $\widetilde{g}$ ). Finally, the learning objective is defined as follows

\max_{\theta}\frac{1}{|\mathcal{G}|}\sum_{g\in\mathcal{G}}\sum_{k=1}^{K}\mathcal{MI}\left(\mathbf{h}_{g},\mathbf{h}_{k}\right)

(94)

where $\mathbf{h}_{g}=\text{READOUT}(\{\mathbf{h}_{k}\}_{k=1}^{K})$ and the negative samples to contrast with $\mathbf{h}_{g}$ is $Neg(\mathbf{h}_{g})=\{\widetilde{\mathbf{h}}_{j}\}_{j=1}^{K}$ .

Head-Tail Contrastive (HTC) [99] is proposed to enhance the graph-level representations learned by GNNs. Given a graph $g_{i}=(\mathbf{A}_{i},\mathbf{X}_{i})\in\mathcal{G}$ , it first applies an augmentation transformation $\mathcal{T}(\cdot)$ to obtain $\widetilde{g}_{i}=(\widetilde{\mathbf{A}}_{i},\widetilde{\mathbf{X}}_{i})=\mathcal{T}(\mathbf{A}_{i},\mathbf{X}_{i})$ . Then a shared graph encoder $f_{\theta_{1}}(\cdot)$ are applied to $g_{i}$ and $\widetilde{g}_{i}$ to obtain node embedding matirces $\mathbf{H}_{i}$ and $\widetilde{\mathbf{H}}_{i}$ , respectively. Besides, a $\text{READOUT}(\cdot)$ function is used to obtain graph-level representations $\mathbf{h}_{g_{i}}$ and $\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}$ . Moreover, HTC samples $S$ subsets from $\mathbf{H}_{i}\text{, }\{\mathbf{H}_{i}^{(1)},\mathbf{H}_{i}^{(1)},\cdots,\mathbf{H}_{i}^{(S)}\}$ which corresponds to subgraphs $\{g_{i}^{(1)},g_{i}^{(1)},\cdots,g_{i}^{(S)}\}$ . Then HTC calculates a subgraph-level representation by

\overline{\mathbf{h}}_{g_{i}}=f_{\theta_{2}}\Big{(}\big{[}\mathbf{h}_{g_{i}^{(1)}};\mathbf{h}_{g_{i}^{(2)}};\cdots;\mathbf{h}_{g_{i}^{(S)}}\big{]}\Big{)}

(95)

where $\mathbf{h}_{g_{i}^{(s)}}=\operatorname{READOUT}(\mathbf{H}_{i}^{(s)})(1\leq s\leq S)$ and $f_{\theta_{2}}(\cdot)$ is a $(S,1)$ size convolution kernel. Finally, the learning objective is defined as follows

\max_{\theta_{1},\theta_{2}}\frac{1}{|\mathcal{G}|}\sum_{g_{i}\in\mathcal{G}}\mathcal{MI}\left(\mathbf{h}_{g_{i}},\overline{\mathbf{h}}_{g_{i}}\right)

(96)

where the negative samples to contrast with $\mathbf{h}_{g_{i}}$ is $Neg(\mathbf{h}_{g_{i}})=\{\mathbf{h}_{g_{j}}\}_{g_{j}\in\mathcal{G},j\neq i}\cup\widetilde{\mathbf{h}}_{\widetilde{g}_{i}}$ .

B. Generative Methods

In this section, we will continue with some generative methods for graph SSL. However, we will not provide detailed mathematical formulas for some relatively less important works to avoid over-redundancy.

B.1 Graph Autoencoding

Node Attribute Masking [10]. This task is similar to Graph Completion [12], but in this case it masks and reconstructs the features of multiple nodes simultaneously, and it no longer requires that the neighboring node features used for message passing must be unmasked features. It first randomly masks (i.e., set equal to zero) the features of a node set $\mathcal{S}$ . Specifically, it obtains a masked node feature matrix $\widehat{\mathbf{X}}$ where $\widehat{\mathbf{x}}_{i}=\mathbf{x}_{i}\odot\mathbf{m}_{i}$ for $v_{i}\in\mathcal{S}$ , and then ask the model to reconstruct these masked features. More formally,

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\widehat{\mathbf{X}}\right)=\frac{1}{|\mathcal{S}|}\sum_{v_{i}\in\mathcal{S}}\left\|f_{\theta}(\mathbf{A},\widehat{\mathbf{X}})_{v_{i}}-\mathbf{x}_{i}\right\|^{2}

(97)

G-BERT [88] combines the power of GNNs and BERT to performs representation learning for medication recommendation. Specifically, G-BERT uses two BERT-like self-supervised pretext tasks, self-prediction and dual-prediction. The self-prediction takes the learned embedding as input to reconstruct the masked medical codes with the same type, while dual-prediction takes one type of embedding as input and tries to reconstruct the other type.

SLAPS [100] is a graph learning framework that tasks the node attribute reconstruction as the self-supervised pretext task to infer a task-specific latent structure and then apply a graph neural networks on the inferred graph structure.

To tackle the cold-start problem for recommendation tasks, Pretrain-Recsys [82] pre-trains the model by simulating the cold-start scenarios. Specifically, it applies an attention-based meta aggregator to reduce the impact from the cold-start neighbors and takes the target embedding reconstruction as the self-supervised pretext task.

Graph-Bert [15] is a novel graph neural network that pre-trains the model with the self-supervised node attribute reconstruction and structure recovery tasks, and then transfers the model to other downstream tasks directly or with necessary fine-tuning if any supervised label information is available.

C. Predictive Methods

In this section, we will continue with some predictive methods for graph SSL. However, we will not provide detailed mathematical formulas for some relatively less important works to avoid over-redundancy.

C.1 Context-based Prediction (CP)

PairwiseDistance [10]. The pretext task of PairwiseDistance aims to guide the model to preserve global topology information by predicting the shortest path length between different node pairs. More specifically, it first randomly samples a certain amount of node pairs $\mathcal{S}$ from all node pairs $\{(v_{i},v_{j})|v_{i},v_{j}\in\mathcal{V}\}$ and calculates the pairwise node shortest path length $d_{i,j}=d(v_{i},v_{j})$ for node pairs $(v_{i},v_{j})\in\mathcal{S}$ . Furthermore, it groups the shortest path lengths into four categories: $C_{i,j}=0,C_{i,j}=1,C_{i,j}=2$ , and $C_{i,j}=3$ corresponding to $d_{i,j}=1,d_{i,j}=2,d_{i,j}=3$ , and $d_{i,j}\geq 3$ , respectively. The learning objective can then be formulated as a multi-class classification problem, as follows

	$\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=\frac{1}{\|\mathcal{S}\|}$	$\displaystyle\sum_{(v_{i},v_{i})\in\mathcal{S}}\ell\Big{(}f_{w}\big{(}\|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}$		(98)
		$\displaystyle-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\|\big{)},C_{i,j}\Big{)}$		(98)

where $\ell(\cdot)$ denotes the cross entropy loss and $f_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value. Compared with the task of $\text{S}^{2}\text{GRL}$ [59], PairwiseDistance truncates the shortest path longer than 4, mainly to avoid the excessive computational burden and to prevent very noisy ultra-long pairwise distances from dominating the optimization.

EdgeMask [10]. Different from masking edge features, we can also mask edge connections, but instead of reconstructing the entire adjacency matrix directly, EdgeMask takes the link prediction as a pretext task. More specifically, it first masks $m$ edges $\mathcal{M}\in\mathcal{E}$ and also samples $m$ edges $\overline{\mathcal{M}}\in\{(v_{i},v_{j})|v_{i},v_{j}\in\mathcal{V}$ and $(v_{i},v_{j})\notin\mathcal{E}\}$ . Then, learning objective of EdgeMask is to predict whether there exists a link between a given node pair. More formally,

		$\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=\frac{1}{2m}\Big{(}$		(99)
		$\displaystyle\sum_{(v_{i},v_{j})\in\mathcal{M}}\ell\big{(}f_{w}(\|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\|),1\big{)}+$
		$\displaystyle\sum_{(v_{i},v_{j})\in\overline{\mathcal{M}}}\ell\big{(}f_{w}(\|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\|),0\big{)}\Big{)}$

where $\ell(\cdot)$ denotes the cross entropy loss and $f_{\omega}(\cdot)$ linearly maps the input to a 1-dimension value. The EdgeMask task aims to help GNN learn more local structural information.

TopoTER [101] amis to maximize the mutual information between topology transformations and node representations before and after the transformations, which can be relaxed to minimizing the cross entropy between the applied topology transformation types and its estimation from node representations. Given a graph $g=(\mathbf{A},\mathbf{X})$ , it first randomly samples a subset $\mathcal{S}_{1}$ of $M$ connected edges and a subset $\mathcal{S}_{2}$ of $M$ disconnected edges. Then it randomly removes $r\cdot M$ edges from $\mathcal{S}_{1}$ and adds $r\cdot M$ edges to $\mathcal{S}_{2}$ , where $r$ is the edge perturbation rate, to obatain a topology-perturbation graph $\widetilde{g}=(\widetilde{\mathbf{A}},\mathbf{X})$ . Next, it applies a shared graph encoder $f_{\theta}(\cdot)$ to obtain their node embedding matrices $\mathbf{H}=f_{\theta}(\mathbf{A},\mathbf{X})$ and $\widetilde{\mathbf{H}}=f_{\theta}(\widetilde{\mathbf{A}},\mathbf{X})$ . Finally, a mapping function $g_{\omega}(\cdot)$ is applied to the difference $\Delta\mathbf{H}=\mathbf{H}-\widetilde{\mathbf{H}}$ to predict transformation types $\widehat{\mathbf{Y}}_{i,j}=g_{\omega}(\Delta\mathbf{H}_{i},\Delta\mathbf{H}_{j})\in\mathbb{R}^{4}$ for any $e_{i,j}\in\mathcal{S}_{1}\bigcup\mathcal{S}_{2}$ . Then, the learning objective of TopoTER is defined as

\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=-\sum_{e_{i,j}\in\mathcal{S}_{1}\bigcup\mathcal{S}_{2}}\sum_{c=0}^{3}\mathbf{Y}^{(c)}_{i,j}\log(\widehat{\mathbf{Y}}^{(c)}_{i,j})

(100)

where $c$ denotes four transformation types, $c=0$ : add an edge to a disconnected node pair in $\mathcal{S}_{2}$ ; $c=1$ : reomove an edge from a connected node pair in $\mathcal{S}_{1}$ ; $c=2$ : keep the connection between node pairs in $\mathcal{S}_{1}$ ; $c=3$ : keep the disconnection between node pairs in $\mathcal{S}_{2}$ . $\mathbf{Y}^{(c)}_{i,j}$ is the ground-truth binary indicator (0 or 1) for each transformation type.

Centrality Score Ranking [17]. Node centrality is an important metric for graphs, which measures the importance of nodes based on their structural roles in the whole graph. However, different from node property, centrality scores are not comparable among different graphs with different scales. Therefore, it resorts to rank the relative orders between nodes and consider them as pseudo labels. Specifically, four different centrality scores $\mathcal{S}=\{$ eigencentrality, betweenness, closeness, subgraph centrality $\}$ are used. For a given centrality score $s\in\mathcal{S}$ and a node pair $(v_{i},v_{j})$ with relative order $\mathbf{Y}_{i,j}^{s}=\mathbb{I}(s_{i}>s_{j})$ , a mapping function $f_{\theta^{(s)}}(\cdot)$ is applied to estimate its rank score by $\mathbf{r}^{s}=f_{\theta^{(s)}}(\mathbf{A},\mathbf{X})$ , where $\mathbf{r}^{s}_{i}$ denotes the rank score of node $v_{i}$ . The probability of estimated rank order is defined as

\widehat{\mathbf{Y}}_{i,j}^{s}=\frac{\exp(\mathbf{r}^{s}_{i}-\mathbf{r}^{s}_{j})}{1+\exp(\mathbf{r}^{s}_{i}-\mathbf{r}^{s}_{j})}

(101)

The learning objective of this pretext task is defined as

	$\displaystyle\mathcal{L}_{ssl}$	$\displaystyle\left(\{\theta^{(s)}\}_{s=1}^{4},\mathbf{A},\mathbf{X}\right)=-\frac{1}{4N^{2}}\sum_{s\in\mathcal{S}}$		(102)
		$\displaystyle\sum_{i,j\in\mathcal{V}}\Big{[}\mathbf{Y}^{s}_{i,j}\log(\widehat{\mathbf{Y}}^{s}_{i,j})+(1-\mathbf{Y}^{s}_{i,j})\log(1-\widehat{\mathbf{Y}}^{s}_{i,j})\Big{]}$		(102)

ContextLabel [10]. Different from Distance2Labeled, which uses relative position as a self-supervised signal, the pretext task of ContextLabel works by constructing a local distribution for each node and then asking the model to regress these distributions. It first assigns labels for unlabeled node with the Label Propagation (LP) algorithm [102], and then defines the label distribution $\mathbf{y}_{i}$ for node $v_{i}$ within its $k$ -hop neighborhood, where the $c$ -th element of the label distribution vector $\mathbf{y}_{i,c}$ can be defined as

\mathbf{y}_{i,c}=\frac{\left|\mathcal{N}_{i}^{\mathcal{V}_{L}}(c)\right|+\left|\mathcal{N}_{i}^{\mathcal{V}_{U}}(c)\right|}{\left|\mathcal{N}_{i}^{\mathcal{V}_{L}}\right|+\left|\mathcal{N}_{i}^{\mathcal{V}_{U}}\right|},c=1,2,\cdots,C

(103)

where $\mathcal{N}_{i}^{\mathcal{V}_{L}}$ and $\mathcal{N}_{i}^{\mathcal{V}_{U}}$ are the labeled nodes and unlabeled nodes within $k$ -hop neighborhood of node $v_{i}$ , respectively. $\mathcal{N}_{i}^{\mathcal{V}_{L}}(c)$ denotes only those in the neighborhood set with ground-truth label $c$ , and $\mathcal{N}_{i}^{\mathcal{V}_{U}}(c)$ denotes those in the neighborhood set that are assigned label $c$ by Label Propagation (LP) algorithm. The learning objective is defined as

\mathcal{L}_{ssl}\left(\theta,\mathbf{A},\mathbf{X}\right)=\frac{1}{N}\sum_{v_{i}\in\mathcal{V}}\Big{\|}f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}-\mathbf{y}_{i}\Big{\|}^{2}

(104)

In addition to Label Propagation, there are numerous algorithms that can be used to compute pseudo-labels, such as the Iterative Classification Algorithm (ICA) [103], and even a combination of LP and ICA for better performance.

Motivated by the observation that anomalous nodes differ from normal nodes in structures and attributes, Hop-Count based Model (HCM) propose to use global context prediction as a self-supervised task for modeling both local and global contextual information and then achieve better anomaly detection on the attributed networks.

C.2 Self-Training (ST)

SEF [104] presents a framework that first trains a graph attention network model with pseudo labels obtained from unsupervised Louvain clustering [72] and then uses the learned edge attention coefficients as self-supervised edge features. Besides, it encodes the learned edge features via a Set Transformer and combines them with node features for node classification in an end-to-end training manner.

C.3 Domain Knowledge-based Prediction (DK)

DrRepair [105] considers the problem of learning to repair programs from diagnostic feedback by building a program feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback. Besides, it leverages unlabeled programs available online to create a large amount of extra program repair examples, which are used to pre-train a graph neural network for program repair.

D. Summary of Surveyed Works

A summary of all the surveyed works (a total of 71 methods) is presented in Table. A1 and Table. A2, including graph property, pretext task type, augmentation strategy, objective function, training strategy, and the year of publication.

E. Summary of Implementation Details

The implementation details of all the surveyed works are presented in Table. A3 and Table. A4, such as the (node/link/graph) level of downstream tasks, evaluation metrics with specific tasks, and the commonly used datasets.

F. Summary of Common Datasets

A summary and statistics of common graph datasets is shown in Table. A5, including category, graph number, node number per graph, edge number per graph, dimensionality of node attributes, class number and citation papers.

G. Summary of Open-source Codes

A summary of the open-source codes of all the surveyed works is presented in Table. A6, where we provide hyperlinks to their open-source codes, and those for which no open-source code is found are indicated by “N.A.”. Moreover, we have created a GitHub repository https://github.com/LirongWu/awesome-graph-self-supervised-learning to summarize the latest advances in graph SSL, which will be updated in real-time as more papers and their codes become available.

TABLE A1: A summary of the surveyed papers. The summary is organized based on the underlying motivation behind pretext task design.

Methods

Graph Property

Pretext Task

Data Augmentation

Objective Function

Training Strategy

Year

Graph Completion [12]

Attributed

Generative/AE

Attribute Masking

MAE

P&F/JL

2020

Node Attribute

Masking [10]

Attributed

Generative/AE

Attribute Masking

MAE

P&F/JL

2020

Edge Attribute

Masking [11]

Attributed

Generative/AE

Attribute Masking

MAE

P&F

2019

Node Attribute

Denoising [13]

Attributed

Generative/AE

Attribute Masking

MAE

2020

Adjacency Matrix

Reconstruction [14]

Attributed

Generative/AE

Attribute Masking

Edge Perturbation

MAE

2020

Graph Bert [15]

Attributed

Generative/AE

Attribute Masking

Edge Perturbation

MAE

P&F

2020

Pretrain-Recsys [82]

Attributed

Generative/AE

Edge Perturbation

MAE

P&F

2021

GPT-GNN [32]

Heterogeneous

Generative/AR

Attribute Masking

Edge Perturbation

MAE/InfoNCE

P&F

2020

GraphCL [25]

Attributed

Contrastive/G-G

Attribute Masking

Edge Perturbation

Random Walk Sampling

InfoNCE

URL

2020

IGSD [33]

Attributed

Contrastive/G-G

Edge Perturbation

Edge Doffisopm

InfoNCE

JL/URL

2020

DACL [91]

Attributed

Contrastive/G-G

Mixup

InfoNCE

URL

2020

LCC [45]

Attributed

Contrastive/G-G

None

InfoNCE

2021

CSSL [34]

Attributed

Contrastive/G-G

Node Insertion

Edge Perturbation

Uniform Sampling

InfoNCE

P&F/JL/URL

2020

GCC [38]

Unattributed

Contrastive/C-C

Random Walk Sampling

InfoNCE

P&F/URL

2020

GRACE [26]

Attributed

Contrastive/L-L

Attribute Masking

Edge Perturbation

InfoNCE

URL

2020

GCA [40]

Attributed

Contrastive/L-L

Attention-based

InfoNCE

URL

2020

GROC [43]

Attributed

Contrastive/L-L

Gradient-based

InfoNCE

URL

2021

SEPT [46]

Attributed

Contrastive/L-L

Edge Perturbation

InfoNCE

2021

STDGI [28]

Spatial-Temporal

Contrastive/L-L

Attribute Shuffling

JS Estimator

URL

2019

GMI [16]

Attributed

Contrastive/L-L

None

SP Estimator

URL

2020

KS2L [83]

Attributed

Contrastive/L-L

None

InfoNCE

URL

2020

\text{C}\text{G}^{3}

[92]

Attributed

Contrastive/L-L

None

InfoNCE

2020

BGRL [27]

Attributed

Contrastive/L-L

Attribute Masking

Edge Perturbation

Inner Product

URL

2021

TABLE A2: A summary of the surveyed papers. The summary is organized based on the underlying motivation behind pretext task design.

Methods

Graph Property

Pretext Task

Data Augmentation

Objective Function

Training Strategy

Year

SelfGNN [35]

Attributed

Contrastive/L-L

Attribute Masking

Edge Diffusion

MSE

URL

2021

HeCo [47]

Heterogeneous

Contrastive/L-L

None

InfoNCE

URL

2021

PT-DGNN [89]

Dynamic

Contrastive/L-L

Attribute Masking

Edge Perturbation

InforNCE

P&F

2021

COAD [93]

Attributed

Contrastive/L-L

None

Triplet Margin Loss

P&F

2020

Contrst-Reg [29]

Attributed

Contrastive/L-L

Attribute Shuffling

InfoNCE

2021

DGI [9]

Attributed

Contrastive/L-G

Arbitrary

JS Estimator

URL

2019

HDMI [30]

Attributed

Contrastive/L-G

Attribute Shuffling

JS Estimator

URL

2021

DMGI [80]

Heterogeneous

Contrastive/L-G

Attribute Shuffling

JS Estimator

MAE

URL

2020

MVGRL [18]

Attributed

Contrastive/L-G

Attribute Masking

Edge Perturbation

Edge Diffusion

Random Walk Sampling

DV Estimator

JS Estimator

NT-Xent

InfoNCE

URL

2020

HDGI [31]

Heterogeneous

Contrastive/L-G

Attribute Shuffling

JS Estimator

URL

2019

Subg-Con [19]

Attributed

Contrastive/L-C

Importance Sampling

Triplet Margin Loss

URL

2020

Cotext Prediction [11]

Attributed

Contrastive/L-C

Ego-nets Sampling

Cross Entropy

P&F

2019

GIC [48]

Attributed

Contrastive/L-C

Arbitrary

JS Estimator

URL

2020

GraphLoG [96]

Attributed

Contrastive/L-L

Contrastive/G-G

Contrastive/L-C

Attribute Masking

InfoNCE

URL

2021

MHCN [98]

Heterogeneous

Contrastive/L-C

Attribute Shuffling

InfoNCE

2021

EGI [36]

Attributed

Contrastive/L-C

Ego-nets Sampling

SP Estimator

P&F

2020

MICRO-Graph [39]

Attributed

Contrastive/C-G

Knowledge Sampling

InfoNCE

URL

2020

InfoGraph [49]

Attributed

Contrastive/C-G

None

SP Estimator

URL

2019

SUGAR [84]

Attributed

Contrastive/C-G

BFS Sampling

JS Estimator

2021

BiGI [37]

Heterogeneous

Contrastive/C-G

Edge Perturbation

Ego-nets Sampling

JS Estimator

2021

HTC [99]

Attributed

Contrastive/C-G

Attribute Shuffling

SP Estimator

DV Estimator

URL

2021

Node Property

Prediction [10]

Attributed

Predictive/NP

None

MAE

P&F/JL

2020

\text{S}^{2}\text{GRL}

[59]

Attributed

Predictive/CP

None

Cross Entropy

URL

2020

PairwiseDistance [10]

Attributed

Predictive/CP

None

Cross Entropy

P&F/JL

2020

PairwiseAttrSim [10]

Attributed

Predictive/CP

None

MAE

P&F/JL

2020

Distance2Cluster [10]

Attributed

Predictive/CP

None

MAE

P&F/JL

2020

EdgeMask [10]

Attributed

Predictive/CP

None

Cross Entropy

P&F/JL

2020

TopoTER [101]

Attributed

Predictive/CP

Edge Perturbation

Cross Entropy

URL

2021

Centrality Score

Ranking [17]

Attributed

Predictive/CP

None

Cross Entropy

P&F

2019

Meta-path Prediction [69]

Heterogeneous

Predictive/CP

None

Cross Entropy

2020

SLiCE [70]

Heterogeneous

Predictive/CP

None

Cross Entropy

P&F

2020

Distance2Labeled [10]

Attributed

Predictive/CP

None

MAE

P&F/JL

2020

ContextLabel [10]

Attributed

Predictive/CP

None

MAE

P&F/JL

2020

HCM [106]

Attributed

Predictive/CP

Edge Perturbation

Bayesian Inference

URL

2021

Contextual Molecular

Property Prediction [76]

Attributed

Predictive/DK

None

Cross Entropy

P&F

2020

Graph-level

Motif Prediction [76]

Attributed

Predictive/DK

None

Cross Entropy

P&F

2020

Multi-stage

Self-training [71]

Attributed

Predictive/ST

None

2018

Node Clustering [12]

Attributed

Predictive/ST

None

Clustering

P&F/JL

2020

Graph Partitioning [12]

Attributed

Predictive/ST

None

Graph Partitioning

P&F/JL

2020

CAGNN [74]

Attributed

Predictive/ST

None

Clustering

URL

2020

M3S [75]

Attributed

Predictive/ST

None

Clustering

2020

Cluster Preserving [17]

Attributed

Predictive/ST

None

Cross Entropy

P&F

2019

H. Experimental Study Results

Two downstream tasks, namely node and graph classification, and one evaluation metric, namely accuracy, are selected to conduct the experimental study. For node classification, we provide the classification performance of 23 classical methods on 8 commonly used datasets; for graph classification, we select 7 datasets and report the comparison results of 11 representative methods. The results of node classification and graph classification are shown in Table. A7 and Table. A8, respectively, and the best result in each dataset is marked in bold. The metrics are taken directly from their original papers or retrained with the open-source code. When the code is not publicly available, or when it is impractical to run the released code from scratch, we replace the corresponding results with a dash “-”. Moreover, fixed dataset splits are adopted for all methods to make a fair comparison (1) the data splits of Cora, Citeseer, and Pubmed are consistent with that in [22], (2) the data splits of Wiki-CS, Amazon-Computers, Amazon-Photo, Coauthor-CS, and Coauthor-Physics are consistent with that in [27], (3) the data splits of MUTAG, PTC, RDT-B, RDT-M5K, IMDB-B, and IMDB-M are consistent with that in [49], and (4) the data splits of NCI1 are consistent with that in [25].

After analyzing the results in Table. A7 and Table. A8, we have the following observations: (1) There is no single method that can achieve the best performance on all datasets, and the state-of-the-art method on one dataset may be suboptimal on another. This suggests that existing graph self-supervised algorithms are somewhat dependent on the properties of the graph and are not sufficiently general. (2) Algorithms that achieve excellent performance on small-scale datasets (e.g., Cora and Citeseer), such as Graph-Bert [15] and Sub-Con [19], may have suboptimal performance on large-scale datasets (e.g., Coauthor-CS and Coauthor-Physics). This inspires follow-up researchers to evaluate model performance on datasets with different scales, as experimental results on small-scale datasets may be biased and therefore less reliable. (3) Most generative and predictive learning methods are specifically designed for node-level tasks, such as node classification, but nevertheless, they are still not comparable to those state-of-the-art contrastive methods. This suggests that the potential of generative and predictive learning needs to be further explored, and that domain knowledge-based learning may be a promising direction. (4) Due to the lack of comparison with contemporaneous work, many state-of-the-art methods achieve roughly comparable performance on some datasets, such as Amazon-Computers and Coauthor-Physics, with differences smaller than one percent. The purpose of this experimental study is to provide a good foundation for follow-up researchers and to facilitate a fairer experimental comparison for graph self-supervised learning.

I. TimeLine

A complete timeline is provided in Fig. A1 to present a clear development lineage of various graph SSL methods, listing the publication dates (based on the arxiv) of key milestones. Besides, we provide inheritance connections between methods to show how they are developed. More importantly, we provide short descriptions of contributions for some pioneering methods to highlight their importance.

TABLE A3: A summary of implementation details about the surveyed papers, including downstream task level, evaluation metric, and dataset.

Methods

Task Level

Evaluation Metric

Dataset

Graph Completion [12]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Node Attribute

Masking [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed, Reddit

Edge Attribute

Masking [11]

Graph

Graph Classification (ROC-AUC)

MUTAG, PTC, PPI, BBBP, Tox21, ToxCast, ClinTox, MUV,

HIV, SIDER, BACE

Node Attribute

Denoising [13]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Adjacency Matrix

Reconstruction [14]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Graph Bert [15]

Node

Node Classification (Acc),

Node Clustering (NMI)

Cora, Citeseer, Pubmed

Pretrain-Recsys [82]

Node/Link

ML-1M, MOOCs, Last-FM

GPT-GNN [32]

Node/Link

Node Classification (F1-score),

Link Prediction (ROC-AUC)

OAG, Amazon, Reddit

GraphCL [25]

Graph

Graph Classification

(Acc, ROC-AUC)

NCI1, PROTEINS, D&D, COLLAB, RDT-B, RDT-M5K,

GITHUB, MNIST, CIFAR10, MUTAG, IMDB-B, BBBP,

Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE, PPI

IGSD [33]

Graph

Graph Classification (Acc)

MUTAG, PTC, NCI1, IMDB-B, QM9, COLLAB, IMDB-M

DACL [91]

Graph

Graph Classification (Acc)

MUTAG, PTC, IMDB-B, IMDB-M, RDT-B, RDT-M5K

LCC [45]

Graph

Graph Classification (Acc)

IMDB-B, IMDB-M, COLLAB, MUTAG, PROTEINS, PTC,

NCI1, D&D

CSSL [34]

Graph

Graph Classification (Acc)

PROTEINS, D&D, NCI1, NCI109, Mutagenicity

GCC [38]

Node/Graph

Node Classification (Acc),

Graph Classification (Acc)

US-Airport, H-index, COLLAB, IMDB-B, IMDB-M, RDT-B,

RDT-M5K

GRACE [26]

Node

Node Classification

(Acc, Micro-F1)

Cora, Citeseer, Pubmed, DBLP, Reddit, PPI

GCA [40]

Node

Node Classification (Acc)

Amazon-Computers, Amazon-Photo, Coauthor-Physics,

Coauthor-CS, Wiki-CS

GROC [43]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed, Amazon-Photo, Wiki-CS

SEPT [46]

Node/Link

Last-FM, Douban, Yelp

STDGI [28]

Node

Node Regression

(MAE, RMSE, MAPE)

METR-LA

GMI [16]

Node/Link

Node Classification

(Acc, Micro-F1),

Link Prediction (ROC-AUC)

Cora, Citeseer, PubMed, Reddit, PPI, BlogCatalog, Flickr

KS2L [83]

Node/Link

Node Classification (Acc),

Link Prediction (ROC-AUC)

Cora, Citeseer, Pubmed, Amazon-Computers,

Amazon-Photo, Coauthor-CS

\text{C}\text{G}^{3}

[92]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed, Amazon-Computers,

Amazon-Photo, Coauthor-CS

BGRL [27]

Node

Node Classification

(Acc, Micro-F1)

Wiki-CS, Amazon-Computers, Amazon-Photo, PPI,

Coauthor-CS, Coauthor-Physics, ogbn-arxiv

SelfGNN [35]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed, Amazon-Computers,

Amazon-Photo, Coauthor-CS, Coauthor-Physics

HeCo [35]

Node

Node Classification (ROC-AUC,

Micro-F1, Macro-F1),

Node Clustering (NMI, ARI)

ACM, DBLP, Freebase, AMiner

PT-DGNN [89]

Link

Link Prediction (ROC-AUC)

HepPh, Math Overflow, Super User

COAD [93]

Node/Link

Node Clustering

(Precision, Recall, F1-score),

Link Prediction

(HitRatio@K, MRR)

AMiner, News, LinkedIn

Contrast-Reg [29]

Node/Link

Node Classification (Acc),

Node Clustering

(NMI, Acc, Macro-F1),

Link Prediction (ROC-AUC)

Cora, Citeseer, Pubmed, Reddit, ogbn-arxiv, Wikipedia,

ogbn-products, Amazo-Computers, Amazo-Photo

DGI [9]

Node

Node Classification

(Acc, Micro-F1)

Cora, Citeseer, Pubmed, Reddit, PPI

HDMI [30]

Node

Node Classification

(Micro-F1, Macro-F1),

Node Clustering (NMI)

ACM, IMDB, DBLP, Amazon

DMGI [80]

Node

Node Clustering (NMI),

Node Classification (Acc)

ACM, IMDB, DBLP, Amazon

MVGRL [18]

Node/Graph

Node Classification (Acc),

Node Clustering (NMI, ARI),

Graph Classification (Acc)

Cora, Citeseer, Pubmed, MUTAG, PTC, IMDB-B,

IMDB-M, RDT-B

TABLE A4: A summary of implementation details about the surveyed papers, including downstream task level, evaluation metric, and dataset.

Methods

Task Level

Evaluation Metric

Dataset

HDGI [31]

Node

Node Classification

(Micro-F1, Macro-F1),

Node Clustering (NMI, ARI)

ACM, DBLP, IMDB

Subg-Con [19]

Node

Node Classification

(Acc, Micro-F1)

Cora, Citeseer, Pubmed, PPI, Flickr, Reddit

Cotext Prediction [11]

Graph

Graph Classification (ROC-AUC)

MUTAG, PTC, PPI, BBBP, Tox21, ToxCast, ClinTox,

MUV, HIV, SIDER, BACE

GIC [48]

Node/Link

Node Classification (Acc),

Node Clustering

(Acc, NMI, ARI),

Link Prediction

(ROC-AUC, ROC-AP)

Cora, Citeseer, Pubmed, Amazon-Computers,

Amazon-Photo, Coauthor-CS, Coauthor-Physics

GraphLoG [96]

Graph

Graph Classification (ROC-AUC)

BBBP, Tox21, ToxCast, ClinTox, MUV, HIV, SIDER,

BACE

MHCN [98]

Node/Link

Last-FM, Douban, Yelp

EGI [36]

Node/Link

Node Classification (Acc),

Link Prediction

(ROC-AUC, MRR)

YAGO, Airport

MICRO-Graph [39]

Graph

Graph Classification (ROC-AUC)

BBBP, Tox21, ToxCast, ClinTox, HIV, SIDER, BACE

InfoGraph [49]

Graph

Graph Classification (Acc)

MUTAG, PTC, RDT-B, RDT-M5K, IMDB-B, QM9,

IMDB-M

SUGAR [84]

Graph

Graph Classification (Acc)

MUTAG, PTC, PROTEINS, D&D, NCI1, NCI109

BiGI [37]

Link

Link Prediction

(AUC-ROC, AUC-PR)

DBLP, ML-100K, ML-1M, Wikipedia

HTC [99]

Graph

Graph Classification (Acc)

MUTAG, PTC, IMDB-B, IMDB-M, RDT-B, QM9,

RDT-M5K

Node Property

Prediction [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

\text{S}^{2}\text{GRL}

[59]

Node/Link

Node Classification

(Acc, Micro-F1),

Node Clustering (NMI),

Link Prediction (ROC-AUC)

Cora, Citeseer, Pubmed, PPI, Flickr, BlogCatalog,

PairwiseDistance [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

PairwiseAttrSim [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Distance2Cluster [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

EdgeMask [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

TopoTER [101]

Node/Graph

Node Classification (Acc),

Graph Classification (Acc)

Cora, Citeseer, Pubmed, MUTAG, PTC, RDT-B,

RDT-M5K, IMDB-B, IMDB-M

Centrality Score

Ranking [17]

Node/Link/Graph

Node Classification (Micro-F1),

Link Prediction (Micro-F1),

Graph Classification (Micro-F1)

Cora, Pubmed, ML-100K, ML-1M, IMDB-M, IMDB-B

Meta-path Prediction [69]

Node/Link

Node Classification (F1-score),

Link Prediction (ROC-AUC)

ACM, IMDB, Last-FM, Book-Crossing

SLiCE [70]

Link

Link Prediction

(ROC-AUC, Micro-F1)

Amazon, DBLP, Freebase, Twitter, Healthcare

Distance2Labeled [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

ContextLabel [10]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed, Reddit

HCM [106]

Node

Node Classification (ROC-AUC)

ACM, Amazon, Enron, BlogCatalog, Flickr

Contextual Molecular

Property Prediction [76]

Graph

Graph Classification (Acc),

Graph Regression (MAE)

BBBP, SIDER, ClinTox, BACE, Tox21, ToxCast, ESOL,

FreeSolv, Lipo, QM7, QM8

Graph-level

Motif Prediction [76]

Graph

Graph Classification (Acc),

Graph Regression (MAE)

BBBP, SIDER, ClinTox, BACE, Tox21, ToxCast, ESOL,

FreeSolv, Lipo, QM7, QM8

Multi-stage

Self-training [71]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Node Clustering [12]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Graph Partitioning [12]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

CAGNN [74]

Node

Node Classfication

(Micro-F1, Macro-F1),

Node Clustering

(Micro-F1, Macro-F1, NMI)

Cora, Citeseer, Pubmed

M3S [75]

Node

Node Classification (Acc)

Cora, Citeseer, Pubmed

Cluster Preserving [17]

Node/Link/Graph

Node Classification (Micro-F1),

Link Prediction (Micro-F1),

Graph Classification (Micro-F1)

Cora, Pubmed, ML-100K, ML-1M, IMDB-M, IMDB-B

TABLE A5: A summary and statistics of common graph datasets for graph self-supervised learning.

Dataset

References

[1] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” arXiv preprint arXiv:2006.08218, vol. 1, no. 2, 2020.
[2] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
[3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[4] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” arXiv preprint arXiv:1706.02216, 2017.
[9] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” in ICLR (Poster), 2019.
[10] W. Jin, T. Derr, H. Liu, Y. Wang, S. Wang, Z. Liu, and J. Tang, “Self-supervised learning on graphs: Deep insights and new direction,” arXiv preprint arXiv:2006.10141, 2020.
[11] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, “Strategies for pre-training graph neural networks,” arXiv preprint arXiv:1905.12265, 2019.
[12] Y. You, T. Chen, Z. Wang, and Y. Shen, “When does self-supervision help graph convolutional networks?” in International Conference on Machine Learning. PMLR, 2020, pp. 10 871–10 880.
[13] F. Manessi and A. Rozza, “Graph-based neural network models with multiple self-supervised auxiliary tasks,” arXiv preprint arXiv:2011.07267, 2020.
[14] Q. Zhu, B. Du, and P. Yan, “Self-supervised training of graph convolutional networks,” arXiv preprint arXiv:2006.02380, 2020.
[15] J. Zhang, H. Zhang, C. Xia, and L. Sun, “Graph-bert: Only attention is needed for learning graph representations,” arXiv preprint arXiv:2001.05140, 2020.
[16] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, “Graph representation learning via graphical mutual information maximization,” in Proceedings of The Web Conference 2020, 2020, pp. 259–270.
[17] Z. Hu, C. Fan, T. Chen, K.-W. Chang, and Y. Sun, “Pre-training graph neural networks for generic structural feature extraction,” arXiv preprint arXiv:1905.13728, 2019.
[18] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs,” in International Conference on Machine Learning. PMLR, 2020, pp. 4116–4126.
[19] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu, “Sub-graph contrast for scalable self-supervised graph representation learning,” arXiv preprint arXiv:2009.10273, 2020.
[20] Y. Xie, Z. Xu, Z. Wang, and S. Ji, “Self-supervised learning of graph neural networks: A unified review,” arXiv preprint arXiv:2102.10757, 2021.
[21] Y. Liu, S. Pan, M. Jin, C. Zhou, F. Xia, and P. S. Yu, “Graph self-supervised learning: A survey,” arXiv preprint arXiv:2103.00111, 2021.
[22] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[24] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, 2020.
[25] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[26] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Deep graph contrastive representation learning,” arXiv preprint arXiv:2006.04131, 2020.
[27] S. Thakoor, C. Tallec, M. G. Azar, R. Munos, P. Veličković, and M. Valko, “Bootstrapped representation learning on graphs,” arXiv preprint arXiv:2102.06514, 2021.
[28] F. L. Opolka, A. Solomon, C. Cangea, P. Veličković, P. Liò, and R. D. Hjelm, “Spatio-temporal deep graph infomax,” arXiv preprint arXiv:1904.06316, 2019.
[29] K. Ma, H. Yang, H. Yang, T. Jin, P. Chen, Y. Chen, B. F. Kamhoua, and J. Cheng, “Improving graph representation learning by contrastive regularization,” arXiv preprint arXiv:2101.11525, 2021.
[30] B. Jing, C. Park, and H. Tong, “Hdmi: High-order deep multiplex infomax,” arXiv preprint arXiv:2102.07810, 2021.
[31] Y. Ren, B. Liu, C. Huang, P. Dai, L. Bo, and J. Zhang, “Heterogeneous deep graph infomax,” arXiv preprint arXiv:1911.08538, 2019.
[32] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-gnn: Generative pre-training of graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1857–1867.
[33] H. Zhang, S. Lin, W. Liu, P. Zhou, J. Tang, X. Liang, and E. P. Xing, “Iterative graph self-distillation,” arXiv preprint arXiv:2010.12609, 2020.
[34] J. Zeng and P. Xie, “Contrastive self-supervised learning for graph classification,” arXiv preprint arXiv:2009.05923, 2020.
[35] Z. T. Kefato and S. Girdzijauskas, “Self-supervised graph neural networks without explicit negative sampling,” arXiv preprint arXiv:2103.14958, 2021.
[36] Q. Zhu, Y. Xu, H. Wang, C. Zhang, J. Han, and C. Yang, “Transfer learning of graph neural networks with ego-graph information maximization,” arXiv preprint arXiv:2009.05204, 2020.
[37] J. Cao, X. Lin, S. Guo, L. Liu, T. Liu, and B. Wang, “Bipartite graph embedding via mutual information maximization,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 635–643.
[38] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, “Gcc: Graph contrastive coding for graph neural network pre-training,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1150–1160.
[39] S. Zhang, Z. Hu, A. Subramonian, and Y. Sun, “Motif-driven contrastive learning of graph representations,” arXiv preprint arXiv:2012.12533, 2020.
[40] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Graph contrastive learning with adaptive augmentation,” arXiv preprint arXiv:2010.14945, 2020.
[41] P. Bonacich, “Power and centrality: A family of measures,” American journal of sociology, vol. 92, no. 5, pp. 1170–1182, 1987.
[42] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
[43] N. Jovanović, Z. Meng, L. Faber, and R. Wattenhofer, “Towards robust graph contrastive learning,” arXiv preprint arXiv:2102.13085, 2021.
[44] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
[45] Y. Ren, J. Bai, and J. Zhang, “Label contrastive coding based graph neural network for graph classification,” arXiv preprint arXiv:2101.05486, 2021.
[46] J. Yu, H. Yin, M. Gao, X. Xia, X. Zhang, and N. Q. V. Hung, “Socially-aware self-supervised tri-training for recommendation,” arXiv preprint arXiv:2106.03569, 2021.
[47] X. Wang, N. Liu, H. Han, and C. Shi, “Self-supervised heterogeneous graph neural network with co-contrastive learning,” arXiv preprint arXiv:2105.09111, 2021.
[48] C. Mavromatis and G. Karypis, “Graph infoclust: Leveraging cluster-level node information for unsupervised graph representation learning,” arXiv preprint arXiv:2009.06946, 2020.
[49] F.-Y. Sun, J. Hoffmann, V. Verma, and J. Tang, “Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization,” arXiv preprint arXiv:1908.01000, 2019.
[50] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in International Conference on Machine Learning. PMLR, 2018, pp. 531–540.
[51] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,” arXiv preprint arXiv:1606.00709, 2016.
[52] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.
[53] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, pp. 1857–1865.
[54] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, “On mutual information maximization for representation learning,” arXiv preprint arXiv:1907.13625, 2019.
[55] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[56] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 403–412.
[57] M. Kemertas, L. Pishdad, K. G. Derpanis, and A. Fazly, “Rankmi: A mutual information maximizing ranking loss,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 362–14 371.
[58] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[59] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng, “Self-supervised graph representation learning via global context prediction,” arXiv preprint arXiv:2003.01604, 2020.
[60] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in In 5-th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
[61] Z. Gao, H. Lin, S. Li et al., “Clustering based on graph of density topology,” arXiv preprint arXiv:2009.11612, 2020.
[62] L. Wu, Z. Liu, Z. Zang, J. Xia, S. Li, S. Li et al., “Deep clustering and representation learning that preserves geometric structures,” arXiv preprint arXiv:2009.09590, 2020.
[63] S. Z. Li, L. Wu, and Z. Zang, “Consistent representation learning for high dimensional data analysis,” arXiv preprint arXiv:2012.00481, 2020.
[64] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4066–4075.
[65] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5147–5156.
[66] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016, pp. 478–487.
[67] R. McConville, R. Santos-Rodriguez, R. J. Piechocki, and I. Craddock, “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding,” arXiv preprint arXiv:1908.05968, 2019.
[68] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149.
[69] D. Hwang, J. Park, S. Kwon, K.-M. Kim, J.-W. Ha, and H. J. Kim, “Self-supervised auxiliary learning with meta-paths for heterogeneous graphs,” arXiv preprint arXiv:2007.08294, 2020.
[70] P. Wang, K. Agarwal, C. Ham, S. Choudhury, and C. K. Reddy, “Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks,” arXiv preprint arXiv:2007.11192, 2020.
[71] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[72] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[73] V. A. Traag, L. Waltman, and N. J. Van Eck, “From louvain to leiden: guaranteeing well-connected communities,” Scientific reports, vol. 9, no. 1, pp. 1–12, 2019.
[74] Y. Zhu, Y. Xu, F. Yu, S. Wu, and L. Wang, “Cagnn: Cluster-aware graph neural networks for unsupervised graph representation learning,” arXiv preprint arXiv:2009.01674, 2020.
[75] K. Sun, Z. Lin, and Z. Zhu, “Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5892–5899.
[76] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, “Self-supervised graph transformer on large-scale molecular data,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[77] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
[78] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures from non-enzymes without alignments,” Journal of molecular biology, vol. 330, no. 4, pp. 771–783, 2003.
[79] M. Zitnik and J. Leskovec, “Predicting multicellular function through multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–i198, 2017.
[80] C. Park, D. Kim, J. Han, and H. Yu, “Unsupervised attributed multiplex network embedding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5371–5378.
[81] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” arXiv preprint arXiv:2103.03230, 2021.
[82] B. Hao, J. Zhang, H. Yin, C. Li, and H. Chen, “Pre-training graph neural networks for cold-start users and items representation,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 265–273.
[83] L. Yu, S. Pei, C. Zhang, L. Ding, J. Zhou, L. Li, and X. Zhang, “Self-supervised smoothing graph neural networks,” arXiv preprint arXiv:2009.00934, 2020.
[84] Q. Sun, H. Peng, J. Li, J. Wu, Y. Ning, P. S. Yu, and L. He, “Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism,” arXiv preprint arXiv:2101.08170, 2021.
[85] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[86] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” arXiv preprint arXiv:2010.01028, 2020.
[87] J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” arXiv preprint arXiv:2010.04592, 2020.
[88] J. Shang, T. Ma, C. Xiao, and J. Sun, “Pre-training of graph augmented transformers for medication recommendation,” arXiv preprint arXiv:1906.00346, 2019.
[89] J. Zhang, K. Chen, and Y. Wang, “Pre-training on dynamic graph neural networks,” arXiv preprint arXiv:2102.12380, 2021.
[90] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[91] V. Verma, M.-T. Luong, K. Kawaguchi, H. Pham, and Q. V. Le, “Towards domain-agnostic contrastive learning,” arXiv preprint arXiv:2011.04419, 2020.
[92] S. Wan, S. Pan, J. Yang, and C. Gong, “Contrastive and generative graph convolutional networks for graph-based semi-supervised learning,” arXiv preprint arXiv:2009.07111, 2020.
[93] B. Chen, J. Zhang, X. Zhang, X. Tang, L. Cai, H. Chen, C. Li, P. Zhang, and J. Tang, “Coad: Contrastive pre-training with adversarial fine-tuning for zero-shot expert linking,” arXiv preprint arXiv:2012.11336, 2020.
[94] T. Kipf, E. van der Pol, and M. Welling, “Contrastive learning of structured world models,” arXiv preprint arXiv:1911.12247, 2019.
[95] S. Cheng, L. Zhang, B. Jin, Q. Zhang, and X. Lu, “Drug target prediction using graph representation learning via substructures contrast,” 2021.
[96] M. Xu, H. Wang, B. Ni, H. Guo, and J. Tang, “Self-supervised graph-level representation learning with local and global structure,” 2021. [Online]. Available: https://openreview.net/forum?id=DAaaaqPv9-q
[97] L. Xu, A. Krzyzak, and E. Oja, “Rival penalized competitive learning for clustering analysis, rbf net, and curve detection,” IEEE Transactions on Neural networks, vol. 4, no. 4, pp. 636–649, 1993.
[98] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X. Zhang, “Self-supervised multi-channel hypergraph convolutional network for social recommendation,” arXiv preprint arXiv:2101.06448, 2021.
[99] C. Wang and Z. Liu, “Graph representation learning by ensemble aggregating subgraphs via mutual information maximization,” arXiv preprint arXiv:2103.13125, 2021.
[100] B. Fatemi, L. E. Asri, and S. M. Kazemi, “Slaps: Self-supervision improves structure learning for graph neural networks,” arXiv preprint arXiv:2102.05034, 2021.
[101] X. Gao, W. Hu, and G.-J. Qi, “Topo{ter}: Unsupervised learning of topology transformation equivariant representations,” 2021. [Online]. Available: https://openreview.net/forum?id=9az9VKjOx00
[102] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919.
[103] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, pp. 93–93, 2008.
[104] A. Sehanobish, N. G. Ravindra, and D. van Dijk, “Self-supervised edge features for improved graph neural network training,” arXiv preprint arXiv:2007.04777, 2020.
[105] M. Yasunaga and P. Liang, “Graph-based, self-supervised program repair from diagnostic feedback,” in International Conference on Machine Learning. PMLR, 2020, pp. 10 799–10 808.
[106] T. Huang, Y. Pei, V. Menkovski, and M. Pechenizkiy, “Hop-count based self-supervised anomaly detection on attributed networks,” arXiv preprint arXiv:2104.07917, 2021.
[107] C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proceedings of the third ACM conference on Digital libraries, 1998, pp. 89–98.
[108] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000.
[109] P. Mernyei and C. Cangea, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv preprint arXiv:2007.02901, 2020.
[110] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
[111] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” arXiv preprint arXiv:2005.00687, 2020.
[112] J. Li, X. Hu, J. Tang, and H. Liu, “Unsupervised streaming feature selection in social media,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1041–1050.
[113] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, pp. 1–19, 2015.
[114] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research, vol. 12, no. 9, 2011.
[115] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel, “Protein function prediction via graph kernels,” Bioinformatics, vol. 21, no. suppl_1, pp. i47–i56, 2005.
[116] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classification,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, 2008.
[117] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity,” Journal of medicinal chemistry, vol. 34, no. 2, pp. 786–797, 1991.
[118] N. Kriege and P. Mutzel, “Subgraph matching kernels for attributed graphs,” arXiv preprint arXiv:1206.6483, 2012.
[119] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data, vol. 1, no. 1, pp. 1–7, 2014.
[120] I. F. Martins, A. L. Teixeira, L. Pinheiro, and A. O. Falcao, “A bayesian approach to in silico blood-brain barrier penetration modeling,” Journal of chemical information and modeling, vol. 52, no. 6, pp. 1686–1697, 2012.
[121] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “Deeptox: toxicity prediction using deep learning,” Frontiers in Environmental Science, vol. 3, p. 80, 2016.
[122] R. Huang, M. Xia, D.-T. Nguyen, T. Zhao, S. Sakamuru, J. Zhao, S. A. Shahane, A. Rossoshek, and A. Simeonov, “Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs,” Frontiers in Environmental Science, vol. 3, p. 85, 2016.
[123] A. M. Richard, R. S. Judson, K. A. Houck, C. M. Grulke, P. Volarath, I. Thillainadarajah, C. Yang, J. Rathman, M. T. Martin, J. F. Wambaugh et al., “Toxcast chemical landscape: paving the road to 21st century toxicology,” Chemical research in toxicology, vol. 29, no. 8, pp. 1225–1251, 2016.
[124] P. A. Novick, O. F. Ortiz, J. Poelman, A. Y. Abdulhay, and V. S. Pande, “Sweetlead: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery,” PloS one, vol. 8, no. 11, p. e79568, 2013.
[125] E. J. Gardiner, J. D. Holliday, C. O’Dowd, and P. Willett, “Effectiveness of 2d fingerprints for scaffold hopping,” Future medicinal chemistry, vol. 3, no. 4, pp. 405–414, 2011.
[126] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork, “The sider database of drugs and side effects,” Nucleic acids research, vol. 44, no. D1, pp. D1075–D1079, 2016.
[127] G. Subramanian, B. Ramsundar, V. Pande, and R. A. Denny, “Computational modeling of $\beta$ -secretase 1 (bace-1) inhibitors using ligand based approaches,” Journal of chemical information and modeling, vol. 56, no. 10, pp. 1936–1949, 2016.
[128] K. Riesen and H. Bunke, “Iam graph database repository for graph based pattern recognition and machine learning,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2008, pp. 287–297.
[129] J. Kazius, R. McGuire, and R. Bursi, “Derivation and validation of toxicophores for mutagenicity prediction,” Journal of medicinal chemistry, vol. 48, no. 1, pp. 312–320, 2005.
[130] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[131] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[132] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its technical challenges,” Communications of the ACM, vol. 57, no. 7, pp. 86–94, 2014.

	$\displaystyle\mathcal{L}_{ssl}\left(\theta,\omega,\mathbf{A},\mathbf{X}\right)=\frac{1}{\|\mathcal{S}\|}$	$\displaystyle\sum_{(v_{i},v_{i})\in\mathcal{S}}\ell\Big{(}f_{w}\big{(}\|f_{\theta}(\mathbf{A},\mathbf{X})_{v_{i}}$		(98)
		$\displaystyle-f_{\theta}(\mathbf{A},\mathbf{X})_{v_{j}}\|\big{)},C_{i,j}\Big{)}$		(98)

	MUTAG	PTC	NCI1	RDT-B	RDT-M5K	IMDB-B	IMDB-M
GraphCL [25]	86.80	-	77.87	89.53	55.99	71.14	51.83
IGSD [33]	90.20	61.40	75.40	-	-	74.70	51.50
DACL [91]	87.51	63.59	-	85.11	53.20	73.98	50.78
LCC [45]	90.50	65.90	82.90	-	-	76.10	52.40
CSSL [34]	88.95	-	80.09	84.21	53.76	-	-
GCC [38]	-	-	-	87.80	53.00	75.60	50.90
MVGRL [18]	89.70	62.50	-	84.50	54.03	74.20	51.20
InfoGraph [49]	89.01	61.65	-	82.50	53.46	73.03	49.69
SUGAR [84]	96.74	77.53	84.39	-	-	-	-
HTC [99]	91.80	65.50	-	91.10	55.70	73.30	50.60
TopoTER [101]	89.25	64.59	-	84.93	55.52	73.46	49.68