Entropy Neural Estimation for Graph Contrastive Learning

Yixuan Ma [email protected] Lanzhou UniversityLanzhouChina , Xiaolin Zhang [email protected] ShenzhenChina , Peng Zhang pengzhang˙[email protected] College of Computer Science and Engineering,
Shandong University of Science and TechnologyQingdaoChina and Kun Zhan [email protected] School of Information Science and Engineering,
Lanzhou UniversityLanzhouChina

(2023)

Abstract.

Contrastive learning on graphs aims at extracting distinguishable high-level representations of nodes. In this paper, we theoretically illustrate that the entropy of a dataset can be approximated by maximizing the lower bound of the mutual information across different views of a graph, i.e., entropy is estimated by a neural network. Based on this finding, we propose a simple yet effective subset sampling strategy to contrast pairwise representations between views of a dataset. In particular, we randomly sample nodes and edges from a given graph to build the input subset for a view. Two views are fed into a parameter-shared Siamese network to extract the high-dimensional embeddings and estimate the information entropy of the entire graph. For the learning process, we propose to optimize the network using two objectives, simultaneously. Concretely, the input of the contrastive loss function consists of positive and negative pairs. Our selection strategy of pairs is different from previous works and we present a novel strategy to enhance the representation ability of the graph encoder by selecting nodes based on cross-view similarities. We enrich the diversity of the positive and negative pairs by selecting highly similar samples and totally different data with the guidance of cross-view similarity scores, respectively. We also introduce a cross-view consistency constraint on the representations generated from the different views. This objective guarantees the learned representations are consistent across views from the perspective of the entire graph. We conduct extensive experiments on seven graph benchmarks, and the proposed approach achieves competitive performance compared to the current state-of-the-art methods. The source code will be publicly released once this paper is accepted.

graph contrastive learning, unsupervised representation learning, graph neural network

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 28, 2023 – November 3, 2023; Ottawa, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 28, 2023 – November 3, 2023, Ottawa, Canada^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†submissionid: 1558^†^†ccs: Computing methodologies Unsupervised learning^†^†ccs: Computing methodologies Learning latent representations^†^†ccs: Mathematics of computing Information theory^†^†ccs: Computing methodologies Neural networks

1. Introduction

The goal of unsupervised contrastive learning is to learn a representation of data such that similar instances are close together in the representation space, while dissimilar instances are far apart. This is achieved by maximizing an estimate of the mutual information between different views of the data. Mutual information is a measure of the cross-dependence between two random variables and can be used to quantify how much information one variable contains about another. In the context of unsupervised contrastive learning, maximizing mutual information between different views of the data means that the algorithm tries to optimize the cross-dependency between these views in high-dimensional space. This can help improve the performance of models by allowing them to extract useful information from data.

Unsupervised contrastive learning is usually expressed as maximizing the mutual information between augmented multi-view data (Oord et al., 2018; Belghazi et al., 2018). Given an unsupervised deep graph representation learning task, the estimation process of the mutual information is to optimize the cross-dependency between different views of the data in high-dimensional space (Belghazi et al., 2018; Poole et al., 2019). Practically, a lower bound of the information can be estimated by the contrastive learning methodology.

Refer to caption — Figure 1. The architecture of the discriminator of $\mathcal{M}$ -ILBO. Two-view subsets, $X^{(1)}$ and $X^{(2)}$ , are sampled from the raw dataset $X$ in each epoch. Pairwise data points, ${\bm{x}}_{i}^{(1)}$ and ${\bm{x}}^{(2)}_{j}$ , are inputs of the encoder and their similarity score is $d_{ij}\,$ .

We propose to simulate view augmentation via subset sampling from raw data and contrastively maximize the mutual information across subsets. The training process can be viewed as the learning of the correlations between two subsets of the given graph data. As illustrated in Fig. 1, we randomly sample two subsets of the input graph, i.e.,node features and graph edges. Then, both of the subsets are fed into a shared graph encoder for extracting high-level representations of the input nodes . Since both subsets are drawn from the same input graph, and seen as two different observation views of the input, the corresponded representations ought to consistent across them. Therefore, we employ a discriminator to maintain the cross-view consistency for both subsets, where representations are forced to maintain close among high-similar nodes while stay apart between dissimilar nodes. Constrastive learning maximizes the similarity between positive pairs of the same nodes from different views while minimize the similarity across negative pairs of different nodes. Such a way of selecting positive and negative pairs violates the purpose of clustering features of the similar nodes. Differently, we think that the nodes with highly similar representations should be strengthed and treated as positive pairs, while only the nodes with totally dissimilar representations can be used as negative pairs. Thus, proposed a simple yet effective standards for selecting the positive and negative pairs to enrich the diversity of constructed self-supervisions. Besides the proposed method, we theoretically analyze that this approach can guarantee the maximization of the Information-entropy Lower BOund (ILBO) for a given dataset. In particular, we use the Jensen-Shannon divergence parameterized by a neural network to maximize ILBO, i.e., we estimate entropy by maximizing ILBO.

The contrastive loss in each epoch comprehensively maximizes ILBO, which is similar to Bagging algorithms, i.e., bootstrapping enhances diversity and the model learns to output representation with good discriminative ability of different classes. Particularly, the contrastive learning process compacts these representations with similar information in the encoded high-dimensional manifold. In contrastive learning, positive and negative pairs are included in the contrastive loss to scatter categories. A classical data pair partition strategy is that positive pairs have the same indices in two representations while negative pairs are selected in the same way except by doing random shuffling for one of the representations (Velickovic et al., 2019; Hassani and Khasahmadi, 2020). Then the pairwise similarity scores are used to compute the contrastive loss. However, a defect of the classical strategy is that samples belonging to the same category are often treated as negative pairs. This defect results in low accuracy for downstream classification tasks. In contrast to the classical strategy, we consider further enriching the supervision information of positive and negative pairs with the guidance of the similarity scores. Particularly, we add more positive pairs in non-diagonal whose similarity scores are high, and negative pairs whose similarity scores are low. Our intuition is based on the fact that nodes with high similarity are more likely to belong to the same category, while nodes with low similarity tend to belong to different categories. Selecting samples of the same category as positive pairs and samples of different categories as negative pairs does help to render the learned representations to contain more task-relevant information.

Besides the aforementioned contrastive objective, we innovatively propose to introduce a cross-view consistency loss to align the two views in a global perspective. The consistency loss is based on error modeling, i.e., we hypothesize the error of cross-view embeddings is governed by the Gaussian distribution and the cross-view consistency loss is defined by taking the negative logarithm of likelihood of the error. In practice, such consistency is realized by positive pairs, i.e., a positive pair is two representations of the same nodes in different views, and has been verified to be quite useful for the unlabeled samples (Sajjadi et al., 2016; Tarvainen and Valpola, 2017; Berthelot et al., 2019; Xie et al., 2020; Zhang et al., 2021a). This cross-view consistency is to ensure that the predicted representation of the entire graph should be consistent when feeding different subsets as input. Concretely, we choose a simple yet effective approach to constrain the cross-view consistency by using the $\ell_{2}$ norm of embeddings across views.

In a nutshell, we propose a maximizing ILBO ( $\mathcal{M}$ -ILBO) algorithm, in which we present a novel contrastive loss to train a parameter-shared graph encoder by two sampled subsets of the raw dataset. We argue that the estimation of information entropy can be achieved by gradient descent over neural networks. Further, a cross-view consistency regularization renders different views to produce the aligned information. With the contrastive loss and the cross-view consistency, $\mathcal{M}$ -ILBO gathers more and more task-relevant information. Extensive experiments under various conditions are conducted to verify the superiorities of $\mathcal{M}$ -ILBO on seven graph datasets. Compared to existing state-of-the-art unsupervised representation learning algorithms, $\mathcal{M}$ -ILBO achieves competitive performances and exhibits satisfactory generalization ability.

The main contributions are summarized as follows:

•

We theoretically analyze the entropy of a dataset can be estimated by maximizing ILBO. Based on this, we propose an effective approach that samples subsets of a dataset and maximizes their mutual information so as to maximize ILBO.
•

We propose an efficient graph representation learning framework with contrastive learning. The proposed $\mathcal{M}$ -ILBO framework train a graph encoder with a novel contrastive loss and a cross-view consistency regularization.
•

We design a novel positive and negative pairs selection strategy in order to improve the representation ability. We use the similarity matrix to select pairs upon their scores. Through data sampling and contrastive learning between positive and negative pairs, the representation continuously learns more and more task-relevant information.

2. Entropy Neural Estimation

2.1. Problem formulation

In contrastive learning, the optimization objective is usually formulated as the cross-view mutual information lower bound maximization problem (Oord et al., 2018; Belghazi et al., 2018). This means that the model’s performance is seriously determined by the view augmentation. Inspired by the concept of Bagging (Breiman, 1996) and Dropout (Srivastava et al., 2014), this paper regards subset sampling as view augmentation and reformulates the contrastive learning as a cross-subset mutual information entropy lower bound maximization problem.

Given a certain dataset $X$ , ${\rm E}(X)$ denotes its information entropy. Notably, ${\rm E}(X)$ measures $X$ ’s implicit data distribution structure. According to the information theory, it is easy to know that $X$ ’s entropy equals its mutual information, i.e., ${\rm E}(X)={\rm I}(X,X)$ . This can be addressed by neural estimation of the cross-view mutual information in contrastive learning. Unfortunately, the solution is seriously limited by view augmentation. Instead of well-designed view augmentation, we propose to approximate the information entropy by subset re-sampling. Particularly, the two sampled subsets, i.e., $X^{(i)},\forall\,i\in\{1,2\}$ , can be seen as two views of the same dataset $X$ , where $i$ indicates the $i$ -th view. Without loss of generality, we have ${\rm I}(X,X)\geq{\rm I}(X^{(1)},X^{(2)})$ . We approach ${\rm E}(X)$ by maximizing the mutual information of pairwise subsets of dataset $X$ . Then, we have

(1)

\displaystyle{\rm E}(X)={\rm I}(X,X)\geq{\rm I}(X^{(1)},X^{(2)})\,.

A neural estimator of the information entropy can be realized by the accumulation of the mutual information between these re-sampled subsets.

The mutual information, ${\rm I}(X^{(1)},X^{(2)})$ , is defined by

(2)

\displaystyle{\rm I}(X^{(1)},X^{(2)})={\rm div}_{\rm kl}\bigl{(}p({\bm{x}}^{(1)},{\bm{x}}^{(2)})\|p({\bm{x}}^{(1)})p({\bm{x}}^{(2)})\bigr{)}

where ${\bm{x}}^{(i)}\in X^{(i)},\forall\,i\in\{1,2\}$ denote the samples in subset $X^{(i)}$ , and ${\rm div}_{\rm kl}$ denotes the Kullback-Leibler divergence.

Following contrastive learning (Oord et al., 2018; Belghazi et al., 2018), the above optimization of the mutual information between two subsets can turn into a lower bound maximization problem. In practice, we use Jensen-Shannon divergence instead of the Kullback-Leibler divergence as Nowozin et al. (2016); Poole et al. (2019), and the lower bound is thus estimated by

		$\displaystyle{\rm div}_{\rm js}\bigl{(}p({\bm{x}}^{(1)},{\bm{x}}^{(2)})\\|p({\bm{x}}^{(1)})p({\bm{x}}^{(2)})\bigr{)}$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{p({\bm{x}}^{(1)},{\bm{x}}^{(2)})}\log\bigl{(}d({\bm{x}}^{(1)},{\bm{x}}^{(2)}\|\theta)\bigr{)}$
	$\displaystyle+$	$\displaystyle\mathbb{E}_{p({\bm{x}}^{(1)})p({\bm{x}}^{(2)})}\log\bigl{(}1-d({\bm{x}}^{(1)},{\bm{x}}^{(2)}\|\theta)\bigr{)}$
(3)		$\displaystyle=$	$\displaystyle{\rm ILBO}=-\mathcal{L}_{\rm{cl}}$

where ${\rm div}_{\rm js}$ denotes the Jensen-Shannon divergence. Compared with the Kullback-Leibler divergence, there is an advantage of the Jensen-Shannon divergence: the Jensen-Shannon divergence is symmetrical which can objectively measure the distance between two distributions to avoid different training results due to different sequences.

We argue that the estimation of information entropy can be achieved by gradient descent over neural networks, i.e., we use $d({\bm{x}}^{(1)},{\bm{x}}^{(2)}|{\theta})$ as its neural network and define $\mathcal{L}_{\rm{cl}}$ to be its loss function. It can be seen from Eq. (3) that maximizing ILBO is equivalent to minimizing the loss function $\mathcal{L}_{\rm{cl}}$ . For a specific contrastive learning task, we can seek a parameterized discriminator $d({\bm{x}}^{(1)},{\bm{x}}^{(2)}|{\theta})$ to approach the lower bound of the mutual information ${\rm I}(X^{(1)},X^{(2)})$ . The process of maximizing ILBO defines an entropy neural estimation.

By the aid of Eq. (3), we can exploit the contrastive loss, i.e., $\mathcal{L}_{\rm cl}$ , to guide the discriminator $d({\bm{x}}^{(1)},{\bm{x}}^{(2)}|\theta)$ in estimating the mutual information between two subsets, where $d({\bm{x}}^{(1)},{\bm{x}}^{(2)}|\theta)$ is a discriminator with parameters $\theta$ that estimates the mutual information of the two subsets. That is, the discriminator $d({\bm{x}}^{(1)},{\bm{x}}^{(2)}|\theta)$ maximizes the lower bound of the mutual information between subsets $X^{(1)}$ and $X^{(2)}$ under the supervision of $\mathcal{L}_{\rm cl}$ . This will be further discussed in $\S$ 2.2.

In addition to the contrastive loss, the error between the two representations are also minimized. Suppose ${\bm{z}}^{(i)}=f({\bm{x}}^{(i)}|\theta),\forall\,i\in\{1,2\}$ denotes the output embedding, we hypothesize the error $\bm{\varepsilon}_{i}$ of cross-view embeddings is governed by the Gaussian distribution following Eq. (4), i.e.,

(4)

\displaystyle\bm{\varepsilon}_{i}\sim\frac{1}{\sqrt{2\pi}}\exp\Bigl{(}-\frac{\|\bm{\varepsilon}_{i}\|^{2}_{2}}{2}\Bigr{)}

Suppose our data point ${\bm{x}}\in X$ is i.i.d., the cross-view consistency, $\mathcal{L}_{\rm cvc}$ , is defined by taking the negative logarithm of likelihood of the error,

(5)

\mathcal{L}_{\rm cvc}=\frac{1}{n}\sum_{i=1}^{n}{\left\|\bm{\varepsilon}_{i}\right\|}_{2}^{2}=\frac{1}{n}\sum_{i=1}^{n}{\left\|{\bm{z}}_{i}^{(1)}-{\bm{z}}_{i}^{(2)}\right\|}_{2}^{2}\,.

The cross-view consistency is an optimal matching metric between two output embeddings which measures how similar they are. It means that the similar data points have similar embedding features.

Through the above analysis, we concentrate two different objectives, namely the contrastive loss and the cross-view consistency, into one loss function. Then, the overall objective function for the overall learning process is given by

(6)

\mathcal{L}=\mathcal{L}_{\rm cl}+\lambda\mathcal{L}_{\rm cvc}

where $\lambda$ is a hyper-parameter for balancing the contrastive term and the cross-view consistency term.

2.2. Discriminator

In our framework, the neural discriminator estimates the mutual information between the two subsets sampling from the raw dataset. According to Eq. (3), minimizing the contrastive loss is equivalent to maximizing the mutual information between the two sampled subsets and forces the model to explore intrinsic structure determined by ${\rm E}(X)$ . After multiple sampling, the distribution of the sampled subsets will approach the distribution of the raw dataset. That is, after multiple training with different samplings in each epoch, the discriminator approximately estimates the information entropy of the raw dataset, ${\rm E}(X)$ .

In practice, we construct our discriminator by adapting two parameter-shared graph encoders $f$ to the Siamese networks (Bromley et al., 1993; Chopra et al., 2005; Hadsell et al., 2006). Given two views $X^{(1)}$ and $X^{(2)}$ , we use the parameter-shared encoder $f$ to generate the pairwise representations of ${\bm{x}}_{i}^{(1)}$ and ${\bm{x}}_{j}^{(2)}$ ,

(7)		$\displaystyle{\bm{z}}_{i}^{(1)}$	$\displaystyle=f({\bm{x}}_{i}^{(1)}\|\theta),{\bm{x}}_{i}^{(1)}\in X^{(1)}\,,$
(8)		$\displaystyle{\bm{z}}_{j}^{(2)}$	$\displaystyle=f({\bm{x}}_{j}^{(2)}\|\theta),{\bm{x}}_{j}^{(2)}\in X^{(2)}$

where $\theta$ is the parameter of the network $f$ and $z_{i}^{(1)}$ and $z_{j}^{(2)}$ are the latent embeddings corresponding to the input ${\bm{x}}_{i}^{(1)}$ and ${\bm{x}}_{j}^{(2)}$ , respectively. The output similarity score $d_{ij}$ is calculated by the inner product of the pairwise representations from different views,

	$\displaystyle D=\bigl{(}Z^{(1)}\bigr{)}^{\top}Z^{(2)}=[d_{ij}({\bm{x}}_{i}^{(1)},{\bm{x}}_{j}^{(2)}\|\theta)]$
(13)		$\displaystyle=\left[\begin{array}[]{cccc}{\color[rgb]{0,0,0}({\bm{z}}_{1}^{(1)})^{\top}{\bm{z}}_{1}^{(2)}}&{\color[rgb]{0,0,0}({\bm{z}}_{1}^{(1)})^{\top}{\bm{z}}_{2}^{(2)}}&{\color[rgb]{0,0,0}\cdots}&{\color[rgb]{0,0,0}({\bm{z}}_{1}^{(1)})^{\top}{\bm{z}}_{n}^{(2)}}\\ {\color[rgb]{0,0,0}({\bm{z}}_{2}^{(1)})^{\top}{\bm{z}}_{1}^{(2)}}&{\color[rgb]{0,0,0}({\bm{z}}_{2}^{(2)})^{\top}{\bm{z}}_{2}^{(2)}}&{\color[rgb]{0,0,0}\cdots}&{\color[rgb]{0,0,0}({\bm{z}}_{2}^{(1)})^{\top}{\bm{z}}_{n}^{(2)}}\\ {\color[rgb]{0,0,0}\vdots}&{\color[rgb]{0,0,0}\vdots}&{\color[rgb]{0,0,0}\ddots}&{\color[rgb]{0,0,0}\vdots}\\ {\color[rgb]{0,0,0}({\bm{z}}_{n}^{(1)})^{\top}{\bm{z}}_{1}^{(2)}}&{\color[rgb]{0,0,0}({\bm{z}}_{n}^{(1)})^{\top}{\bm{z}}_{2}^{(2)}}&{\color[rgb]{0,0,0}\cdots}&{\color[rgb]{0,0,0}({\bm{z}}_{n}^{(1)})^{\top}{\bm{z}}_{n}^{(2)}}\\ \end{array}\right]\,.$

Our discriminator produces the similarity score for two arbitrary inputs. Under the supervision of $\mathcal{L}_{\rm cl}$ , we estimate the mutual information of $X^{(1)}$ and $X^{(2)}$ by maximizing the scores of positive data pairs from the joint distribution $p({\bm{x}}_{i}^{(1)},{\bm{x}}_{i}^{(2)})$ while minimizing the scores of negative data pairs drawn independently from the marginal distribution $p({\bm{x}}_{i}^{(1)})p({\bm{x}}_{j}^{(2)})$ (Belghazi et al., 2018; Oord et al., 2018). We use $\mathcal{P}$ and $\mathcal{N}$ to denote the sets of positive pairs and negative pairs, respectively.

With the help of the Monte-Carlo method, the contrastive objective of $\mathcal{L}_{\rm cl}$ can be explicitly expressed by

	$\displaystyle\mathcal{L}_{\rm cl}=$	$\displaystyle-\frac{1}{\|\mathcal{P}\|}\sum_{(i,i)\in\mathcal{P}}\log d_{ii}({\bm{x}}_{i}^{(1)},{\bm{x}}_{i}^{(2)}\|\theta)$
(14)			$\displaystyle-\frac{1}{\|\mathcal{N}\|}\sum_{{(i,j)\in\mathcal{N}}}\log\bigl{(}1-d_{ij}({\bm{x}}_{i}^{(1)},{\bm{x}}_{j}^{(2)}\|\theta)\bigr{)}$

where ${\bm{x}}_{i}^{(1)}$ and ${\bm{x}}_{i}^{(2)}$ are belong to positive pairs $(i,i)\in\mathcal{P}$ while ${\bm{x}}_{i}^{(1)}$ and ${\bm{x}}_{j}^{(2)}$ are negative pairs $(i,j)\in\mathcal{N},\forall\,i\neq j$ . After the above information ensemble, the upper bound of the mutual information is the information entropy of the raw dataset. Furthermore, because of the input subset sampling and the construction of supervision pairs, our contrastive objective of $\mathcal{L}_{\rm cl}$ is more powerful and has better prediction performance, which is also empirically confirmed in $\S$ 3.2.

2.3. Maximizing ILBO

The process of maximizing ILBO is to learn the graph encoder $f$ parameterized by $\theta$ in the discriminator. It includes two aspects during the training process, i.e., the input subset sampling and the construction of supervision pairs.

2.3.1. Subset sampling

Nodes and edges are two fundamental elements in a graph. Data sampling is based on bootstrapping, so using different subsets renders the graph encoder ensemble information of the raw set. The subset sampling is straightforward. For each node feature $\bm{h}_{i}$ or each edge $a_{ij}$ , we only have two choices: drop it or keep it.

The designed sampling approach naturally needs to select the input sets of both nodes and edges. Generally, we follow very simple rules to sample them: 1) each node and edge are randomly selected w.r.t. the Bernoulli distribution; 2) to maintain the structure of the input graph, values of the dropped elements are set to zeros.

Particularly, the node feature selection follows Eq. (15)

(15)

\bm{h}_{i}=v\bm{h}_{i},\,\,v\sim{\rm Bern}(v|1-p_{h})

where ${\rm Bern}(v|1-p_{h})$ is the Bernoulli distribution with the drop rate $p_{h}$ and the value $v$ is 0 or 1.

Analogously, the edges can be selected in the same way, which is also inspired by DropEdge (Rong et al., 2020; Zhu et al., 2020). We consider a direct way for sampling input graphs where we randomly drop edges in the graph with the DropEdge rate $p_{a}$ . After edge sampling, the new adjacency matrix $\hat{A}=[\hat{a}_{ij}]$ is given by,

(16)

\hat{a}_{ij}=\begin{cases}0,&a_{ij}=0\,;\\ u,&a_{ij}=1\,{\rm and}\,u\sim{\rm Bern}(u|1-p_{a})\end{cases}

where ${\rm Bern}(u|1-p_{a})$ is the Bernoulli distribution, and $p_{a}<1$ . For the edge level sampling, $\hat{a}_{ij}$ is set to zero for the dropped edges with Eq. (16).

2.3.2. Construction of supervision pairs

As illustrated in Eq. (14), the mutual information estimation between views/subsets turns to maximize scores of positive pairs while minimizing scores of negative pairs under the supervision of the contrastive loss. Therefore, it is critical to construct the supervision pairs carefully. In this section, we present a well-designed selection strategy of positive and negative pairs.

In each batch, we feed $X^{(1)}=[{\bm{x}}_{1}^{(1)},{\bm{x}}_{2}^{(1)},\ldots,{\bm{x}}_{n}^{(1)}]$ and $X^{(2)}=[{\bm{x}}_{1}^{(2)},{\bm{x}}_{2}^{(2)},\ldots,{\bm{x}}_{n}^{(2)}]$ into the graph encoder, and obtain the corresponding latent embeddings $Z^{(1)}=[{\bm{z}}_{1}^{(1)},{\bm{z}}_{2}^{(1)},\ldots,{\bm{z}}_{n}^{(1)}]$ and $Z^{(2)}=[{\bm{z}}_{1}^{(2)},{\bm{z}}_{2}^{(2)},\ldots,{\bm{z}}_{n}^{(2)}]$ where $n$ is the batch size . As the encoder network does not change the mapping structure of the graph, items have the same indices in $X$ and $Z$ , i.e., since the encoder is a one-to-one mapping, shuffling $X$ has the same effect of shuffling $Z$ . If we shuffle the indices of the embedding $Z$ , the correspondence between the nodes in the two different views, i.e., $Z^{(1)}$ and $Z^{(2)}$ , will be disrupted. In the classic strategy, positive pairs have the same indices from $Z^{(1)}$ and $Z^{(2)}$ . Negative pairs are selected who have the same index from $Z^{(1)}$ and the shuffled $\hat{Z}^{(2)}$ . Therefore, positive and negative pairs can be conveniently determined.

The two subsets can be seen as two views of the raw dataset, so the score $d_{ij}$ of positive and negative pairs can be selected in the cross-view similarity matrix $D$ . In order to enlarge the sets $\mathcal{P}$ and $\mathcal{N}$ , we obtain a cross-view similarity matrix $D$ by

(17)

\displaystyle D=[d_{ij}]=\bigl{(}Z^{(1)}\bigr{)}^{\top}Z^{(2)}=\bigl{[}({\bm{z}}^{(1)}_{i})^{\top}{\bm{z}}^{(2)}_{j}\bigr{]}

where $d_{ij}=({\bm{z}}^{(1)}_{i})^{\top}{\bm{z}}^{(2)}_{j}$ is an element of $D$ . By using the similarity matrix $D=[d_{ij}]$ , we modify the contrastive loss to enhance the representation ability of the encoder. As shown in Eq. (17), the positive pairs are diagonal elements definitely. In the non-diagonal parts of $D$ , high values indicate that the pairs have a larger probability of belonging to the same class, so we add them to the set $\mathcal{P}$ of positive pairs. Candidate negative pairs have lower scores in the non-diagonal parts. The classic positive pairs $d_{ii}$ only locate at the main diagonal of the cross-view similarity matrix, we further add more positive pairs in non-diagonal whose scores are high, and negative pairs are those who have low scores. By using the similarity matrix $D=[d_{ij}]$ , we modify the contrastive loss to enhance the representation ability of the encoder.

In the non-diagonal parts of Eq. (17), we select the top $k$ largest pairs to add in $\mathcal{P}$ and find the top $l$ smallest pairs as the negative. A high score of similarity $d_{ij}$ implies a high probability of belonging to the same category and the pair $(i,j)$ can be regarded as a positive pair. Negative pairs have low scores. If node $1$ and node $2$ have a high similarity score of $({\bm{z}}_{1}^{(1)})^{\top}{\bm{z}}_{2}^{(2)}$ , we will add $(1,2)$ into set $\mathcal{P}$ . If the pair (1,2) actually belongs to the same category, it implies that such a selection strategy helps $\mathcal{M}$ -ILBO to improve the representation ability.

With the training process, the information between nodes is gradually gathered and the dependency cues between nodes are progressively determined with the supervision between nodes. The representation ability improves in the training procedure. In each batch, the contrastive loss guides the model in estimating the mutual information between the two subsets and updates the model. The upper bound of the mutual information of the subsets is the information entropy of the raw set since two subsets are from the same raw set with bootstrapping. Training with different subsets enables data distribution to cover the entire data $X$ . The positive and negative sample selection strategy maximizes the mutual information in the direction of the ground truth.

2.3.3. Algorithm

We summarize the proposed $\mathcal{M}$ -ILBO approach in Algorithm 1. The input graph is denoted by $X=(H,A)$ where $H=[{\bm{h}}_{i}]$ is the node feature and $A=[a_{ij}]$ is the adjacency matrix. Specifically, we first sample two subsets as views using the proposed bootstrapping-based graph subset sampling strategy. Then, we send the features of both views into a graph neural network with shared parameters. A well-designed contrastive loss is utilized to enhance the representation ability of the encoder.

Algorithm 1 The

\mathcal{M}

-ILBO algorithm.

1: Input: Graph data

X=(H,A)

, node feature drop rate

p_{h}

, edge drop rate

p_{a}

, and parameter

\lambda

2: Output:

Z

3: Initialization:

epoch=0

epoch_{\max}

, and model parameter

\theta

4: while

epoch\leq{epoch_{\max}}

5: Sample

X

for subset 1:

X^{(1)}

;

6: Sample

X

for subset 2:

X^{(2)}

;

Z^{(1)}=f(X^{(1)}|\theta)

;

Z^{(2)}=f(X^{(2)}|\theta)

;

D=\bigl{(}Z^{(1)}\bigr{)}^{\top}Z^{(2)}

;

10: Select

\mathcal{P}

and

\mathcal{N}

from

D=[d_{ij}]

;

11: Compute the loss by

\mathcal{L}=\mathcal{L}_{\rm cl}+\lambda\mathcal{L}_{\rm cvc}

;

12: Update parameter

\theta

by back propagation ;

13:

epoch=epoch+1

;

14: end while

15: Output:

Z=f(X|\theta^{\star})

Besides, a consistency constraint renders each view to produce common information. The cross-view consistency loss renders each view to obtain the common information. In order to learn the consensus information in both views, the encoder’s parameters of both views are shared.

As shown in Algorithm 1, we propose a novel genre of graph contrastive learning that simulates view augmentation via subset sampling from raw data and contrastively maximizes mutual information across these subsets, rather than using augmented multi-view data. Our approach theoretically guarantees ILBO for a given dataset, as we demonstrate in our theoretical analysis. To estimate the entropy, we utilize the Jensen-Shannon divergence parameterized by a neural network, thereby maximizing ILBO. During training, we randomly sample a large number of node features and graph edges for bootstrapping, similar to the effect of Dropout. The graph encoder parameters then gradually learn to maximize ILBO for different subsets. Two parameter-shared graph encoders are used to generate high-dimensional representations of two subsets, and discrimination scores are calculated based on the similarity of node pairs from both subsets. These scores guide the selection of positive and negative pairs, enriching the diversity of constructed self-supervisions. The training process is thus transformed into learning correlations between two subsets of the given graph data. This process is illustrated in Fig. 1 and Algorithm 1.

Benefiting from both the contrastive loss and the cross-view consistency loss, $\mathcal{M}$ -ILBO algorithm gathers more and more task-relevant information. With the contrastive loss, $\mathcal{M}$ -ILBO gathers more and more task-relevant information and obtains better representations.

2.4. Discussion

2.4.1. $\mathcal{M}$ -ILBO

Intuitionally, our $\mathcal{M}$ -ILBO roots in two classical techniques, i.e., Bagging (Breiman, 1996) and Dropout (Srivastava et al., 2014). Here, their connections are discussed and analyzed.

Connection to Bagging: $\mathcal{M}$ -ILBO samples two different subsets as the two views, $X^{(1)}$ and $X^{(2)}$ , to train the model in each epoch. That is, the different inputs feed into one encoder in different epochs, so model parameters are updated with two input subsets in each epoch and the updated model is seen as the pre-trained model for the next epoch.

In each epoch, because of using different inputs, the model is trained in an ensemble way. Iteratively, one updated model is further updated as the pre-trained model in the next epoch. After multiple iterations, it is equivalent to training multiple models, and each model participates in an ensemble on the final result, which improves the generalization ability of the model. The optimized model can be seen as the fusion of multiple discriminators. With the trained model, the final representation is obtained from the raw dataset $X$ . The process simulates classical Bagging that generates multiple classifiers by subsets of the training set.

Connection to Dropout: The effect of the information ensemble of our approach is somewhat similar to the effect of Dropout (Srivastava et al., 2014) in deep neural networks, which drops neurons in proportion in each batch. By using Dropout, a sub-network is trained in each batch, while our discriminator is trained with different input subsets.

The ensemble of multiple models generally yields better predictions than a single one. Bagging (Breiman, 1996) has shown great potentiality in improving performance (Bauer and Kohavi, 1999). Bagging is a method for generating multiple classifiers by subsets of the training set. Then, these classifiers are combined by voting, which theoretically reduces the variance of predictions and improves performance (Bühlmann and Yu, 2002).

In this paper, we try to learn in the form of Bagging (Breiman, 1996) and Dropout (Srivastava et al., 2014). Referring to the Dropout effect in deep neural networks (Srivastava et al., 2014), we utilize multiple subsets to train one parameter-shared encoder.

Our discriminator estimates the mutual information of the two subsets of the raw input. In each epoch, the contrastive loss maximizes the mutual information of two subsets of the raw dataset. After multiple sampling, the distribution of the sampled subsets will cover the distribution of the raw dataset. Thus, after multiple training with different inputs of each epoch, the discriminator estimates the information entropy of the raw dataset.

2.4.2. Representation Ability

To improve the representation ability of $\mathcal{M}$ -ILBO, we consider exploiting a graph encoder with a contrastive loss and a cross-view consistency learning loss. For the contrastive loss, a well-designed selection strategy of positive and negative pairs is proposed in this paper. The encoder and two loss functions are useful to improve the representation ability.

A good view encoder $f$ is able to aggregate useful information while removing redundant information (Tishby et al., 2000; Wu et al., 2020), i.e., $\min\,{\rm I}(X,Z)-\beta{\rm I}(Y,Z)$ where we suppose that $Y$ is the ground-truth label indication matrix denoting the target of the downstream task, and $\beta$ is the trade-off hyper-parameter. If there is an ideal encoder, it will compress the entropy of $X$ to the entropy of $Y$ . We use graph convolutional network (GCN) (Kipf and Welling, 2017) as the graph encoder since GCN has powerful graph data encoding and compression ability. At the beginning, its representation ability is low, so the entropy of $Z$ is far from $Y$ . If the encoder is trained very well, it will tend to close to $Y$ .

With the contrastive loss, the cross-view consistency loss and a well-designed positive and negative pairs selection strategy, $\mathcal{M}$ -ILBO gathers more and more information from task-relevant information and obtains better representations. Fig. 1 shows that the representation ability improves with the increase of the epochs. In each epoch, the contrastive loss guides the encoder in estimating the mutual information between the two subsets. The upper bound of their mutual information is the information entropy of the raw set since two subsets are from the same raw set. Training with different subsets enables features to cover the information of the entire data. The positive and negative sample multiple selection strategy maximizes the mutual information in the direction of ground-truth indication matrix $Y$ .

3. Experiments

We conduct various experiments to evaluate the proposed $\mathcal{M}$ -ILBO method on node classification tasks. The focuses of the experiments are to validate the representation ability of the learned features, and the effectiveness of different contrastive objectives. We report the classification accuracy which is the percentage of nodes being correctly predicted.

We use different datasets for experiments. We compare our $\mathcal{M}$ -ILBO approach to different baselines on the citation network datasets and the co-occurrence network datasets.

The three citation networks are Cora (McCallum et al., 2000), Citeseer (Giles et al., 1998), and Pubmed (Namata et al., 2012). Specifically, in these datasets, nodes represent papers, and edges denote the citation relationship. The bag-of-words representation of papers are regarded as node features, and labels are academic fields.

The four co-occurrence networks are Computer (Shchur et al., 2018), Photo (Shchur et al., 2018), CS (Shchur et al., 2018), and Physics (Shchur et al., 2018). Specifically, computer (Shchur et al., 2018) and Photo (Shchur et al., 2018) are two co-purchase graphs from Amazon, which nodes are products and the edge between two nodes indicates that two products are purchased at the same time. The sparse bag-of-words attribute vector is node feature. CS (Shchur et al., 2018) and Physics (Shchur et al., 2018) are two academic networks of co-authorship relationship from Microsoft Academic Graph. Nodes denote authors, and edges denote co-author relationships. The bag-of-words representation of the paper keywords are regarded as node features, and labels are academic fields.

In order to highlight the outstanding representation learning performance of $\mathcal{M}$ -ILBO, we compare the proposed $\mathcal{M}$ -ILBO method with the unsupervised methods and the supervised methods. Specifically, the unsupervised methods include DeepWalk (Perozzi et al., 2014), GAE (Kipf and Welling, 2016), DGI (Velickovic et al., 2019), MVGRL (Hassani and Khasahmadi, 2020), GRACE (Zhu et al., 2020), GCA (Zhu et al., 2021), and CCA-SSG (Zhang et al., 2021b); and the supervised methods include MLP, Label Propagation (LP) (Zhu et al., 2003), GCN (Kipf and Welling, 2017), and GAT (Veličković et al., 2018).

We first train the model on all the nodes in a graph following Algorithm 1. Following the setting in DGI (Velickovic et al., 2019), after the encoder is trained, we test $X$ by feeding it into the learned graph encoder and obtain $Z$ . $Z$ is fed into a downstream classifier, i.e., linear logistic regression, to output the predicted labels for each node. For the logistic regression classifier, we only use the training set for optimizing the classifier and output the classification accuracy on the testing set. Following the setting in DGI (Velickovic et al., 2019), The classification accuracy in Tables 1 and 2 are obtained by the logistic regression classifier. The drop rate $p_{h}$ and $p_{a}$ are selected in the range of $[0,0.99]$ . $\lambda$ is selected in the range of $[0.05,0.4]$ . We train $\mathcal{M}$ -ILBO using Adam optimizer (Kingma and Ba, 2015) where the learning rate is set to be $10^{-3}$ . Without loss of generality, we use the GCN as the backbone, and the number of network layers is two.

Table 1. Accuracy (mean

\%\pm

std

\%

) comparison on citation network datasets. (

H

denotes node features,

A

is the adjacency matrix,

S

is the diffusion matrix, and

Y

is the label indication matrix). The proposed method outperforms the baselines. The best values are in bold, while the second-best numbers are underlined.

	Methods	Input	Cora	Citeseer	Pubmed
Supervised	MLP (Veličković et al., 2018)	$H,Y$	55.1	46.5	71.4
	LP (Zhu et al., 2003)	$A,Y$	68.0	45.3	63.0
	GCN (Kipf and Welling, 2017)	$H,A,Y$	81.5	70.3	79.0
	GAT (Veličković et al., 2018)	$H,A,Y$	83.0 $\pm$ 0.7	72.5 $\pm$ 0.7	79.0 $\pm$ 0.3
Unsupervised	Raw Features (Velickovic et al., 2019)	$H$	47.9 $\pm$ 0.4	49.3 $\pm$ 0.2	69.1 $\pm$ 0.3
	DeepWalk (Perozzi et al., 2014)	$H,A$	70.7 $\pm$ 0.6	51.4 $\pm$ 0.5	74.3 $\pm$ 0.9
	GAE (Kipf and Welling, 2016)	$H,A$	71.5 $\pm$ 0.4	65.8 $\pm$ 0.4	72.1 $\pm$ 0.5
	DGI (Velickovic et al., 2019)	$H,A$	82.3 $\pm$ 0.6	71.8 $\pm$ 0.7	76.8 $\pm$ 0.6
	MVGRL (Hassani and Khasahmadi, 2020)	$H,A,S$	83.5 $\pm$ 0.4	73.3 $\pm$ 0.5	80.1 $\pm$ 0.7
	GRACE (Zhu et al., 2020)	$H,A$	81.9 $\pm$ 0.4	71.2 $\pm$ 0.5	80.6 $\pm$ 0.4
	CCA-SSG (Zhang et al., 2021b)	$H,A$	84.2 $\pm$ 0.4	73.1 $\pm$ 0.3	81.6 $\pm$ 0.4
	$\bm{\mathcal{M}}$ -ILBO	$H,A$	85.7 $\pm$ 0.3	74.2 $\pm$ 0.7	81.8 $\pm$ 0.3

Table 2. Accuracy (mean%

\pm

std%) comparison on co-occurrence network datasets. The proposed method achieves competitive performance compared to the baselines. The best values are in bold, while the second-best numbers are underlined.

Methods	Input	Computer	Photo	CS	Physics
Supervised GCN (Kipf and Welling, 2017)	$H,A,Y$	86.51 $\pm$ 0.54	92.42 $\pm$ 0.22	93.03 $\pm$ 0.31	95.65 $\pm$ 0.16
Supervised GAT (Veličković et al., 2018)	$H,A,Y$	86.93 $\pm$ 0.29	92.56 $\pm$ 0.35	92.31 $\pm$ 0.24	95.47 $\pm$ 0.15
Raw Features (Velickovic et al., 2019)	$H$	73.81 $\pm$ 0.00	78.53 $\pm$ 0.00	90.37 $\pm$ 0.00	93.58 $\pm$ 0.00
DeepWalk (Perozzi et al., 2014)	$H,A$	86.28 $\pm$ 0.07	90.05 $\pm$ 0.08	87.70 $\pm$ 0.04	94.90 $\pm$ 0.09
GAE (Kipf and Welling, 2016)	$H,A$	85.27 $\pm$ 0.19	91.62 $\pm$ 0.13	90.01 $\pm$ 0.71	94.92 $\pm$ 0.07
DGI (Velickovic et al., 2019)	$H,A$	83.95 $\pm$ 0.47	91.61 $\pm$ 0.22	92.15 $\pm$ 0.63	94.51 $\pm$ 0.52
MVGRL (Hassani and Khasahmadi, 2020)	$H,A,S$	87.52 $\pm$ 0.11	91.74 $\pm$ 0.07	92.11 $\pm$ 0.12	95.33 $\pm$ 0.03
GRACE (Zhu et al., 2020)	$H,A$	86.25 $\pm$ 0.25	92.15 $\pm$ 0.24	92.93 $\pm$ 0.01	95.26 $\pm$ 0.02
GCA (Zhu et al., 2021)	$H,A$	87.85 $\pm$ 0.31	92.49 $\pm$ 0.09	93.10 $\pm$ 0.01	95.68 $\pm$ 0.05
CCA-SSG (Zhang et al., 2021b)	$H,A$	88.74 $\pm$ 0.28	93.14 $\pm$ 0.14	93.31 $\pm$ 0.22	95.38 $\pm$ 0.06
$\bm{\mathcal{M}}$ -ILBO	$H,A$	89.16 $\pm$ 0.09	93.73 $\pm$ 0.27	93.23 $\pm$ 0.09	95.43 $\pm$ 0.09

3.1. Comparison with the state-of-the-arts

In Tables 1 and 2, we separately report the performance of the proposed method compared to a variety of state-of-the-art node classification results on the citation network datasets and the co-occurrence network datasets.

We have the following observations: (1) Among all self-supervised learning strategies, $\mathcal{M}$ -ILBO achieves competitive performance on seven datasets, showing state-of-the-art results on most datasets. To be specific, previous works such as DeepWalk (Perozzi et al., 2014) and DGI (Velickovic et al., 2019) merely leverage a single raw dataset. In contrast, $\mathcal{M}$ -ILBO is trained with an information ensemble, and it successfully leverages the ensemble of multiple subsets of the raw dataset to yield better predictions. (2) It is easy to find that the reconstruction-based method, e.g., GAE, has lower accuracy than $\mathcal{M}$ -ILBO. $\mathcal{M}$ -ILBO emphasizes that the model generates representations that include useful information by contrastive learning. (3) Without the supervision of the ground-truth labels, our results are higher than that of supervised methods. $\mathcal{M}$ -ILBO assembles rich predictive information from the raw dataset so that the learned representations are more robust and expressivity, leading to better performance on node classification task. Training with different subsets enables features to cover the information of the entire data. The positive and negative sample multiple selection strategy maximizes the mutual information in the direction of task correlation, which contributes to the efficiency of $\mathcal{M}$ -ILBO.

Table 3. Ablation study of node classification. The best values are in bold.

Methods	Input	Cora	Citeseer	Pubmed
only consistency	$H,A$	82.0	73.2	70.1
shuffling strategy	$H,A$	83.9	72.9	75.8
$\bm{\mathcal{M}}$ -ILBO	$H,A$	85.5	73.7	82.2

3.2. Ablation study

In order to demonstrate the merits of the strategy of supervision pairs, we compare it with the classic shuffling strategy. We also leave out the contrastive loss and only use the consistency loss to show that learning is useful for improving the representation ability. We name the three experiments as “ $\mathcal{M}$ -ILBO strategy”, “shuffling strategy”, and “only consistency”. We conduct the ablation study on citation datasets and the accuracy is shown in Tables 3.

From Tables 3, we have three observations: (1) The accuracy metrics of ” $\mathcal{M}$ -ILBO strategy” are higher than the others, indicating the effectiveness and necessity of the modified contrastive loss. We believe that this is an indication that $\mathcal{M}$ -ILBO comprehensively explores the underlying semantic information and important connective structural information, rather than just using samples with the same index in different subsets as positive pairs. (2) The accuracy of “shuffling strategy” is $83.9\%$ on Cora dataset, while the result of “ $\mathcal{M}$ -ILBO strategy” has $1.6\%$ performance gain. This gap indicates a huge benefit from selection strategies of positive pairs and negative pairs. (3) The accuracy metrics of “only consistency” are lower than the other two on Cora and Pubmed, which indicates that it is crucial to contrast negative pairs. Because of the consistency loss, the distributions of the learned two-view representations are close to each other and representations tend to censuses semantic information.

The representation ability of the discriminator is boosted by the modified contrastive loss, and it does learn more discriminative representations than “shuffling strategy” and “only consistency”.

3.3. Parameter sensitivity analysis

For unsupervised contrastive learning methods, parameter insensitivity is vital for enhancing their stability. Since the value of $\lambda$ is an important parameter in the experiment, we design a sensitivity experiment of $\lambda$ on citation network datasets as shown in Fig. 2(a). It can be seen from Fig. 2(a) that the test accuracy is stable when its value varies in a range of [0.1, 1] with 0.1 intervals.

Figs. 2(b)-(d) show the sensitivity of $p_{h}$ and $p_{a}$ on citation datasets. Specifically, we set $p_{h}$ and $p_{a}$ to $\{0.01,0.1,\ldots,0.9\}$ , and we report test accuracy. It can be observed from Figs. 2(b)-(d) that competitive accuracy is obtained over a wide range of $p_{h}$ and $p_{a}$ .

4. Related Work

A typical contrastive learning model usually augments a raw dataset to construct multiple views and contrasts pairwise representations from different views (Oord et al., 2018). Contrastive learning aims to maximize the similarity scores of positive pairs and minimize that of negative pairs. Since maximizing the mutual information between different representations is theoretical basis of contrastive learning, InfoNCE estimates the mutual information by maximizing its lower bound (Oord et al., 2018). The contrastive loss can be given by Eq. (14). Suppose that positive pairs are from joint distribution and negative pairs are from the product of marginal distributions, so Eq. (14) effectively maximizes the mutual information between $X^{(1)}$ and $X^{(2)}$ (Oord et al., 2018).

In graph contrastive learning, Deep Graph InfoMax (DGI) (Velickovic et al., 2019) maximizes the mutual information between a global feature and a local feature. DGI is inspired by the previous success of the Deep InfoMax (DIM) (Hjelm et al., 2019). DIM maximizes the mutual information between the image patch and corresponding learned high-level representations.

MVGRL (Hassani and Khasahmadi, 2020) expands DGI and exploits two graphs. Both DGI and MVGRL maximize the mutual information between the local feature and the global feature. The global feature is aggregated from the all node features of the whole graph, which is different from DIM. In DIM, a single instance is an image while it is a node in DGI. An image, its patch, and its representation have the same semantic information, and DIM is used for image classification. However, the node and the whole graph do not have similar semantic information in DGI. Later, some graph contrastive learning methods (Peng et al., 2020; Zhu et al., 2020; Wan et al., 2021) contrast node representations from different views.

In this paper, we concentrate on contrastive learning and maximize the agreement of different views. Most existing methods focus on feature information while ignoring the importance of category information, and we select positive pairs by further selecting samples of the same class in addition to different representations from the same instance.

Consistency regularization follows the continuity assumption (Bachman et al., 2014) argue that similar data points have basically consistent output after mapping (Tarvainen and Valpola, 2017). In other words, for unlabeled data, either perturb the model or data, the corresponding predicted results should be consistent.

Following this assumption, many literatures incorporate consistency in their methods. The most widely used and most basic consistency loss is the $\ell_{2}$ regularization. Ladder network (Rasmus et al., 2015) evaluates each data point with and without noise, and then applies a consistency cost between the two predictions to learn the denoising function, which maps the corrupted output onto the denoising targets. Mean Teacher (Tarvainen and Valpola, 2017) defines the consistency cost as the expected distance between the prediction of the student model and the prediction of the teacher model. $\Pi$ -model (Laine and Aila, 2017) encourages consistent network output under two different dropout conditions.

In the field of graph contrastive learning, the most of methods (Peng et al., 2020; Hassani and Khasahmadi, 2020; Zhu et al., 2020; You et al., 2020) generate different views via perturbating graph structures. The intuition is that the data augmentation schemes force the model to learn insensitive representations of perturbations at unimportant nodes and edges. GRACE (Zhu et al., 2020) and GraphCL (You et al., 2020) generate views with different types of graph augmentations and learn node representations by maximizing the agreement of node representations in these two views. Inherited from consistency theory, we maximize the agreement between the outputs of two subsets of input.

5. Conclusion

We introduce a concept of ILBO. ILBO and ELBO (the Evidence Lower BOund) are determined by the raw dataset and are maximized to find model weights. Dataset structure is determined by the entropy of the dataset and we introduced an entropy neural estimation by maximizing ILBO ( $\mathcal{M}$ -ILBO) algorithm for unsupervised contrastive learning. Contrastive learning has demonstrated the success of the neural estimation of the mutual information between high-dimensional data. Contrastive learning algorithms maximize the mutual information of the two-view features. We use a novel contrastive loss to train a parameter-shared graph encoder by two sampled views of the raw dataset. Besides, a cross-view consistency learning loss renders each view to obtain the common information. We used the contrastive loss to train a parameter-shared graph encoder by two sampled subsets of the raw dataset. We sampled two subsets in every epoch and maximized ILBO. In each epoch of the training process, the contrastive loss guides the encoder in estimating the mutual information between the two subsets. The upper bound of their mutual information is the entropy of the raw dataset since the two subsets are sampled from the same raw dataset.

The method of estimating information entropy is to train with the subsets of the raw dataset. The abundant experimental results indicate that the proposed $\mathcal{M}$ -ILBO achieves superior performance on various datasets for graph representation learning compared to state-of-the-art methods. As mentioned above, constructing the supervision pairs has higher memory requirements than shuffling strategy, which can restrict its applicability for large datasets. This aspect is time-consuming and deserves improvement. Extensive experiments under various conditions verify the performance of $\mathcal{M}$ -ILBO in several graph datasets. Comparing to existing state-of-the-art unsupervised representation learning algorithms, $\mathcal{M}$ -ILBO achieves impressive performances and exhibits satisfactory generalization ability.

References

(1)
Bachman et al. (2014) Philip Bachman, Ouais Alsharif, and Doina Precup. 2014. Learning with Pseudo-Ensembles. In NeurIPS. 3365–3373.
Bauer and Kohavi (1999) Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning 36, 1 (1999), 105–139.
Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. 2018. Mutual information neural estimation. In ICML. 531–540.
Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. MixMatch: A holistic approach to semi-supervised learning. In NeurIPS, Vol. 32.
Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine Learning 24, 2 (1996), 123–140.
Bromley et al. (1993) Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature Verification Using a “Siamese” Time Delay Neural Network. In NeurIPS. 737–744.
Bühlmann and Yu (2002) Peter Bühlmann and Bin Yu. 2002. Analyzing bagging. The Annals of Statistics 30, 4 (2002), 927–961.
Chopra et al. (2005) S. Chopra, R. Hadsell, and Y. LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In CVPR. 539–546.
Giles et al. (1998) C Lee Giles, Kurt D Bollacker, and Steve Lawrence. 1998. CiteSeer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital Libraries. 89–98.
Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. 1735–1742.
Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In ICML. 4116–4126.
Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
Laine and Aila (2017) Samuli Laine and Timo Aila. 2017. Temporal Ensembling for Semi-Supervised Learning. In ICLR.
McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2 (2000), 127–163.
Namata et al. (2012) Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. 2012. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Vol. 8. 1.
Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. $f$ -GAN: Training generative neural samplers using variational divergence minimization. In NeurIPS. 271–279.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Peng et al. (2020) Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. 2020. Graph representation learning via graphical mutual information maximization. In WWW. 259–270.
Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In KDD. 701–710.
Poole et al. (2019) Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. 2019. On variational bounds of mutual information. In ICML. 5171–5180.
Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. 2015. Semi-Supervised Learning with Ladder Network. In NeurIPS. 3546–3554.
Rong et al. (2020) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards deep graph convolutional networks on node classification. In ICLR.
Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS. 1171–1179.
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.
Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057 (2000).
Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
Velickovic et al. (2019) Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2019. Deep graph infomax. In ICLR.
Wan et al. (2021) Sheng Wan, Shirui Pan, Jian Yang, and Chen Gong. 2021. Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning. In AAAI. 10049–10057.
Wu et al. (2020) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020. Graph information bottleneck. In NeurIPS, Vol. 33. 20437–20448.
Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. In NeurIPS. 6256–6268.
You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. In NeurIPS. 5812–5823.
Zhang et al. (2021a) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021a. FlexMatch: Boosting Semi-supervised Learning with Curriculum Pseudo Labeling. In NeurIPS.
Zhang et al. (2021b) Hengrui Zhang, Qitian Wu, Junchi Yan, David Wipf, and S Yu Philip. 2021b. From canonical correlation analysis to self-supervised graph neural networks. In NeurIPS.
Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In ICML. 912–919.
Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. In ICML Workshop on Graph Representation Learning and Beyond.
Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In WWW. 2069–2080.