This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Perfect Alignment May be Poisonous to Graph Contrastive Learning

Jingyu Liu    Huayi Tang    Yong Liu
Abstract

Graph Contrastive Learning (GCL) aims to learn node representations by aligning positive pairs and separating negative ones. However, few of researchers have focused on the inner law behind specific augmentations used in graph-based learning. What kind of augmentation will help downstream performance, how does contrastive learning actually influence downstream tasks, and why the magnitude of augmentation matters so much? This paper seeks to address these questions by establishing a connection between augmentation and downstream performance. Our findings reveal that GCL contributes to downstream tasks mainly by separating different classes rather than gathering nodes of the same class. So perfect alignment and augmentation overlap which draw all intra-class samples the same can not fully explain the success of contrastive learning. Therefore, in order to understand how augmentation aids the contrastive learning process, we conduct further investigations into the generalization, finding that perfect alignment that draw positive pair the same could help contrastive loss but is poisonous to generalization, as a result, perfect alignment may not lead to best downstream performance, so specifically designed augmentation is needed to achieve appropriate alignment performance and improve downstream accuracy. We further analyse the result by information theory and graph spectrum theory and propose two simple but effective methods to verify the theories. The two methods could be easily applied to various GCL algorithms and extensive experiments are conducted to prove its effectiveness. The code is available at https://github.com/somebodyhh1/GRACEIS

Machine Learning, ICML

1 Introduction

Graph Neural Networks (GNNs) have been successfully applied in various fields (Yang et al., 2022a) such as recommendation systems (He et al., 2020), drug discovery (Liu et al., 2018), and traffic analysis (Wu et al., 2019), etc (Yang et al., 2023, 2024; Tang & Liu, 2023). However, most GNNs require labeled data for training, which may not always be available. To address this issue, Graph Contrastive Learning (GCL), which does not rely on labels, has gained popularity as a new approach to graph representation learning (Veličković et al., 2018; You et al., 2020).

GCL often generates new graph views through data augmentation (Chen et al., 2020; Zhu et al., 2020; Jia & Zhang, 2022; Mumuni et al., 2024). GCL considers nodes augmented from the same as positive samples and others as negative samples. Subsequently, the model try to maximize similarity between positive samples and minimize similarity between negative ones (Wang & Isola, 2020; Hassani & Khasahmadi, 2020) to learn a better representation. So, data augmentation plays a vital role in graph contrastive learning, and data augmentation can be categorized into three types (Zhao et al., 2022): random augmentation (Veličković et al., 2018; Zhu et al., 2020), rule-based augmentation (Zhu et al., 2021; Wei et al., 2023; Liu et al., 2022), and learning-based augmentation (Suresh et al., 2021; Jiang et al., 2019). For instance, GRACE (Zhu et al., 2020) randomly masks node attributes and edges in graph data to obtain augmented graphs; GCA (Zhu et al., 2021) uses node degree to measure its importance and mask those unimportant with higher probability; And AD-GCL (Suresh et al., 2021) uses a model to learn the best augmentation and remove irrelevant information as much as possible. However, most data augmentation algorithms are designed heuristically, and there is a lack of theoretical analysis on how these methods will influence the downstream performance.

Some researchers have explored the generalization ability of contrastive learning (Arora et al., 2019; Wang & Isola, 2020; Huang et al., 2021). They propose that contrastive learning works by gathering positive pairs and separating negative samples uniformly. Wang et al. (2022b) argues that perfect alignment and uniformity alone cannot guarantee optimal performance. They propose that through stronger augmentation, there will be support overlap between different intra-class samples, which is called augmentation overlap (Saunshi et al., 2022; Huang et al., 2021). The augmentation overlap between two nodes mean that their corresponding augmented node could be the same, in this way aligning the anchor node with the the augmented node could also align two anchor nodes. Thus, the alignment of positive samples will also cluster all the intra-class samples together. And due to the limited inter-class overlap, inter-class nodes will not be gathered. However, Saunshi et al. (2022) points out that augmentation overlap may be relatively rare despite the excellent performance of contrastive learning methods. Hence, chances are that the success of contrastive learning cannot be solely attributed to alignment and augmentation overlap. It is of vital importance to figure out how augmentation works in the contrastive learning process, why the magnitude of augmentation matters so much and how to perform better augmentation. As data augmentation on graphs could be more customized and the magnitude of augmentation can be clearly represented by the number of modified edges/nodes (You et al., 2020), we mainly study the augmentation on graphs.

In this paper, we provide a new understanding of Graph Contrastive Learning and use a theoretical approach to analyze the impact of augmentation on contrastive learning process. We find that with a stronger augmentation, the model is performing better mainly because of inter-class separating rather than intra-class gathering brought by augmentation overlap. This aligns with the finding that augmentation overlap is actually quite rare in contrastive learning (Saunshi et al., 2022). Also, (Wang et al., 2022b) proposes that a stronger augmentation could help because of more augmentation overlap, then more intra-class nodes are gathered due to alignment. However, stronger augmentation naturally conflicts with better alignment, so does stronger augmentation helps intra-class gathering remains questionable. Moreover, stronger augmentation leads to better performance while the alignment is weaken, so we also question that does perfect alignment actually helps contrastive learning?

To further analyze the phenomena, we formulate a relationship between downstream accuracy, contrastive learning loss, and alignment performance, find that weak alignment performance caused by stronger augmentation can benefit the generalization. This explains why stronger augmentation will lead to better performance and reveals that perfect alignment is not the key to success, it may help to decrease contrastive loss, but also enlarge the gap between contrastive learning and downstream task, so specifically designed augmentation strategy is needed to achieve appropriate alignment and get the best downstream accuracy. This is why augmentation matters so much in contrastive learning.

Then, aiming to achieve better downstream accuracy, we need to figure out how to perform augmentation to achieve a better balance between contrastive loss and generalization. Therefore, we further analyze the contrastive process through information theory and graph spectrum theory. From the information theory perspective, we find augmentation should be stronger while keeping enough information, which is widely adopted explicitly or implicitly by designed algorithms (Zhu et al., 2021, 2020; Suresh et al., 2021). From the graph spectrum theory perspective, we analyze how the graph spectrum will affect the contrastive loss and generalization (Liu et al., 2022), finding that non-smooth spectral distribution will have a negative impact on generalization. Then we propose two methods based on the theories to verify our findings.

Our main contributions are as listed follows. (1) We reveal that when stronger augmentation is applied, contrastive learning benefits from inter-class separating more than intra-class gathering, and better alignment may not be helping as it conflicts with stronger augmentation. (2) We establish the relationship between downstream performance, contrastive learning loss, and alignment performance. Finds that better alignment would weaken the generalization, showing that why stronger augmentation helps, then we analyze the result from information theory and graph spectrum theory. (3) Based on the proposed theoretical results, we provide two simple but effective algorithms to verify the correctness of the theory. We also show that these algorithms can be extended to various contrastive learning methods to enhance their performance. (4) Extensive experiments are conducted on different contrastive learning algorithms and datasets using our proposed methods to demonstrate its effectiveness and verify our theory.

2 Augmentation and Generalization

2.1 Preliminaries

A graph can be represented as 𝒢=(𝒱,){\mathcal{G}}=({\mathcal{V}},{\mathcal{E}}), where 𝒱{\mathcal{V}} is the set of NN nodes and 𝒱×𝒱{\mathcal{E}}\subseteq{\mathcal{V}}\times{\mathcal{V}} represents the edge set. The feature matrix and the adjacency matrix are denoted as 𝑿N×F{\bm{X}}\in\mathbb{R}^{N\times F} and 𝑨{0,1}N×N{\bm{A}}\in\{0,1\}^{N\times N}, where FF is the dimension of input feature, 𝒙iF{\bm{x}}_{i}\in\mathbb{R}^{F} is the feature of node viv_{i} and 𝑨i,j=1{\bm{A}}_{i,j}=1 iff (vi,vj)(v_{i},v_{j})\in{\mathcal{E}}. The node degree matrix 𝑫=diag(d1,d2,,dN){\bm{D}}=\mathrm{diag}(d_{1},d_{2},...,d_{N}), where did_{i} is the degree of node viv_{i}.

In contrastive learning, data augmentation is used to create new graphs 𝒢1,𝒢2𝔾aug{\mathcal{G}}^{1},{\mathcal{G}}^{2}\in{\mathbb{G}}^{\text{aug}}, and the corresponding nodes, edges, and adjacency matrices are denoted as 𝒱1,1,𝑨1,𝒱2,2,𝑨2{\mathcal{V}}^{1},{\mathcal{E}}^{1},{\bm{A}}^{1},{\mathcal{V}}^{2},{\mathcal{E}}^{2},{\bm{A}}^{2}. In the following of the paper, vv is used to represent all nodes including the original nodes and the augmented nodes; vi+v_{i}^{+} is used to represent the augmented nodes including both vi1v_{i}^{1} and vi2v_{i}^{2}; vi0v_{i}^{0} represents the original nodes only.

Nodes augmented from the same one, such as (vi1,vi2)(v_{i}^{1},v_{i}^{2}), are considered as positive pairs, while others are considered as negative pairs. It is worth noting that a negative pair could come from the same graph, for node vi1v_{i}^{1}, its negative pair could be vi{vj+|ji}v_{i}^{-}\in\{v_{j}^{+}|j\neq i\}. Graph Contrastive Learning (GCL) is a method to learn an encoder that draws the embeddings of positive pairs similar and makes negative ones dissimilar (Chen et al., 2020; Wang & Isola, 2020). The encoder calculates the embedding of node viv_{i} by f(vi)f(v_{i}), and we assume that f(vi)=1||f(v_{i})||=1.

2.2 How Does Augmentation Affect Downstream Performance

Previous work (Wang & Isola, 2020) proposes that effective contrastive learning should satisfy alignment and uniformity, meaning that positive samples should have similar embeddings, i.e., f(vi1)f(vi2)f(v_{i}^{1})\approx f(v_{i}^{2}), and features should be uniformly distributed in the unit hypersphere. However, Wang et al. (2022b) pointed out that perfect alignment and uniformity does not ensure great performance. For instance, when {f(vi0)}i=1N\{f(v_{i}^{0})\}_{i=1}^{N} are uniformly distributed and f(vi0)=f(vi+)f(v_{i}^{0})=f(v_{i}^{+}), there is a chance that the model may converge to a trivial solution that only projects very similar features to the same embedding, and projects other features randomly, then it will perform random classification in downstream tasks although it achieves perfect alignment and uniformity.

Wang et al. (2022b) argues that perfect alignment and intra-class augmentation overlap would be the best solution. The augmentation overlap means support overlap between different intra-class samples, and stronger augmentation is likely to bring more augmentation overlap. If two intra-class samples have augmentation overlap, then the best solution is projecting the two samples and their augmentation to the same embedding, which is called perfect alignment. For example, if two different nodes vi0v_{i}^{0}, vj0v_{j}^{0} get the same augmentation v+v^{+}, then the best solution to contrative learning is f(vi0)=f(v+)=f(vj0)f(v_{i}^{0})=f(v^{+})=f(v_{j}^{0}). As the intra-class nodes are naturally closer, augmentation overlap often occurs between intra-class nodes, so perfect alignment and augmentation overlap could help intra-class gathering.

However, Saunshi et al. (2022) proposes that augmentation overlap is actually quite rare in practice, even with strong augmentation. Also augmentation overlap requires for strong augmentation, which makes alignment harder and conflicts with perfect alignment, so the success of contrastive learning may not be contributed to intra-class gathering brought by augmentation overlap. Therefore, it is important to understand how is contrastive learning working, and why stronger augmentation helps. More related work are introduced in Appendix E.

To begin with, we give an assumption on the label consistency between positive samples, which means the class label does not change after augmentation.

Assumption 2.1 (View Invariance).

For node vi0v_{i}^{0}, the corresponding augmentation nodes vi+v_{i}^{+} get consistent labels, i.e., we assume the labels are deterministic (one-hot) and consistent: p(y|vi0)=p(y|vi+)p(y|v_{i}^{0})=p(y|v_{i}^{+}).

This assumption is widely adopted (Arora et al., 2019; Wang et al., 2022b; Saunshi et al., 2022) and reasonable. If the augmentation still keeps the basic structure and most of feature information is kept, the class label would not likely to change. Else if the augmentation destroys basic label information, the model tends to learn a trivial solution, so it is meaningless and we do not discuss the situation. The graph data keeps great label consistency under strong augmentation as discussed in Appendix C.3.

To further understand how is data augmentation working in contrastive learning, we use graph edit distance (GED) to denote the magnitude of augmentation, Trivedi et al. (2022) proposes that all allowable augmentations can be expressed using GED which is defined as minimum cost of graph edition (node insertion, node deletion, edge deletion, feature transformation) transforming graph 𝒢0{\mathcal{G}}^{0} to 𝒢+{\mathcal{G}}^{+}. So a stronger augmentation could be defined as augmentation with a larger graph edit distance.

Refer to caption
Figure 1: PCD means positive center distance (𝔼p(vy0|y)f(vy0)μy\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y}||), NCD means negative center distance (𝔼p(vy0|y)f(vy0)μy\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y^{-}}||) and accuracy is the downstream performance. X-axis stands for dropout rate of both edge and feature.
Assumption 2.2 (Augmentation Distance and Augmentation).

While Assumption 2.1 holds i.e., p(y|vi0)=p(y|vi+)p(y|v_{i}^{0})=p(y|v_{i}^{+}), as the augmentation getting stronger, the augmentation distance δaug2=𝔼p(vi0,vi+)f(vi0)f(vi+)2\delta_{aug}^{2}=\mathbb{E}_{p(v_{i}^{0},v_{i}^{+})}||f(v_{i}^{0})-f(v_{i}^{+})||^{2} will increase, i.e., δaugGED(𝒢0,𝒢+)\delta_{aug}\propto\mathrm{GED}({\mathcal{G}}^{0},{\mathcal{G}}^{+}). GED(𝒢0,𝒢+)\mathrm{GED}({\mathcal{G}}^{0},{\mathcal{G}}^{+}) indicates the graph edit distance between 𝒢0{\mathcal{G}}^{0} and 𝒢+{\mathcal{G}}^{+}.

This is a natural assumption that is likely to hold because a bigger difference in input will lead to a bigger difference in output. Also we can notice that δaug\delta_{aug} is actually the distance between the anchor node and the augmented node, so δaug\delta_{aug} could naturally represent the alignment performance, a smaller δaug\delta_{aug} means a better alignment. Then Assumption 2.2 means that stronger augmentation would lead to larger δaug\delta_{aug}, i.e., worse alignment. This phenomena is common in real practice as shown in Appendix C.3.

Definition 2.3.

The class center is calculated by the expectation of all nodes belongs to the class, i.e., μy=𝔼p(v,y)[f(vy)]\mu_{y}=\mathbb{E}_{p(v,y)}\left[f(v_{y})\right]. We use δy+\delta_{y^{+}} and δy\delta_{y^{-}} to represent intra-class and inter-class divergence respectively, and

δy+2=𝔼p(y,i,j)f(vy,i0)f(vy,j0)2,δy2=𝔼p(y,y,i,j)f(vy,i0)f(vy,j0)2,\begin{split}\delta_{y^{+}}^{2}&=\mathbb{E}_{p(y,i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y,j})||^{2},\\ \delta_{y^{-}}^{2}&=\mathbb{E}_{p(y,y-,i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y^{-},j})||^{2},\end{split}

where yy^{-} stands for a class different from yy.

Note that we calculate the class center μy\mu_{y} by averaging nodes from both original view and augmented views. As the augmentation on graphs are highly biased (Zhang et al., 2022), i.e., the mean of augmented nodes are different from the original node, so the class center tends to be different. Also contrastive learning actually learns embedding on the augmented view, so the class gathering result is largely affected by the augmentation, so it is more appropriate to include the augmented nodes when calculate the class center.

With the assumptions, we can get the theorem below:

Theorem 2.4 (Augmentation and Classification).

If Assumption 2.1 holds, we know that:

𝔼p(vy0|y)f(vy0)μyδy++23δaug,\displaystyle\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y}||\leq\delta_{y^{+}}+\frac{2}{3}\delta_{aug}, (1)
𝔼p(vy0|y)f(vy0)μyδy+23δaug,\displaystyle\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y^{-}}||\leq\delta_{y^{-}}+\frac{2}{3}\delta_{aug}, (2)

The proof can be found in Appendix A.1. This shows that the distance between a node and the class center could be represented by the augmentation distance δaug\delta_{aug} and the inter-class/intra-class divergence δy\delta_{y^{-}}, δy+\delta_{y^{+}}. We then use positive and negative center distance to represent 𝔼p(vy0|y)f(vy0)μy\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y}|| and 𝔼p(vy0|y)f(vy0)μy\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y^{-}}||, respectively.

As we assumed in Assumption 2.2, when the augmentation becomes stronger, augmentation distance i.e., δaug\delta_{aug} would increase. Also we notice that both positive and negative center distance are positively related to the magnitude of augmentation δaug\delta_{aug}. Therefore, stronger augmentation separates both inter-class and intra-class nodes, i.e., it helps inter-class separating and hinders intra-class gathering. But the downstream performance tends to be better with stronger augmentation (Wang et al., 2022b; Zhu et al., 2020), so the performance gain may be brought by inter-class separating more than intra-class gathering.

The experiment shown in Figure 1 confirms our suspicion. We use dropout on edges and features to perform augmentation, and the dropout rate naturally represents the magnitude of augmentation i.e., graph edit distance. We present the positive/negative center distance and downstream accuracy to show the changing tendency. Figure 1 shows that initially, as the dropout rate increases, positive center distance is not decreasing, but downstream performance could be enhanced as negative center distance increases sharply. So the better performance correlates to inter-class separating, and the intra-class nodes may not be gathered.

We show the results on more datasets including shopping graph, graph with heterophily and coauthor network in Appendix C.2. From those experiments, we can conclude that contrastive learning mainly contributes to downstream tasks by separating nodes of different classes (increasing negative center distance) rather than gathering nodes of the same class (non-decreasing positive center distance). This explains why contrastive learning can achieve satisfying performance with limited augmentation overlap and relatively weak alignment (Saunshi et al., 2022).

We can also understand the phenomena intuitively, The InfoNCE loss NCE{\mathcal{L}}_{\mathrm{NCE}} can be written as below:

NCE=𝔼p(vi1,vi2)𝔼p(vi)[logexp(f(vi1)Tf(vi2))exp(f(vi1)Tf(vi))].{\mathcal{L}}_{\mathrm{NCE}}=\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}\mathbb{E}_{p({v_{i}^{-}})}\left[-\log\frac{\exp(f(v_{i}^{1})^{T}f(v_{i}^{2}))}{\sum\exp(f(v_{i}^{1})^{T}f(v^{-}_{i}))}\right].

The numerator stands for positive pair similarity, so stronger augmentation would make positive pair dissimilar and the numerator is harder to maximize. Then GCL would pay more attention to the minimize the denominator as shown in Appendix C.5. Minimizing the denominator is actually separating negative samples, and separating negative samples could effectively separate inter-class nodes as most negative samples are from the different classes. In contrast, with stronger augmentation augmentation overlap is still quite rare and positive pair are harder to be aligned, so intra-class nodes are hard to be gathered while the existence of intra-class negative nodes further weaken intra-class gathering. As a result intra-class nodes may not gather closer during contrastive learning. Also we can observe from Figure 1 that when we drop too much edges/features, downstream performance decreases sharply, and both positive and negative center similarity increases as too much information is lost and the basic assumption p(y|vi0)=p(y|vi+)p(y|v_{i}^{0})=p(y|v_{i}^{+}) does not hold, then a trivial solution is learned.

2.3 Augmentation and Generalization

Although GCL with a stronger augmentation may help to improve downstream performance, why it works stays unclear. We need to figure out the relationship between augmentation distance, contrastive loss and downstream performance to further guide algorithm design. We first define the mean cross-entropy (CE) loss below, and use it to represent downstream performance.

Definition 2.5 (Mean CE loss).

For an encoder ff and downstream labels y[1,K]y\in\left[1,K\right], we use the mean CE\mathrm{CE} loss ^CE=𝔼p(v0,y)[logexp(f(v0)Tμy)j=1Kexp(f(v0)Tμj)]\hat{{\mathcal{L}}}_{\mathrm{CE}}=\mathbb{E}_{p(v^{0},y)}\left[-\log\frac{\exp(f(v^{0})^{T}\mu_{y})}{\sum_{j=1}^{K}\exp(f(v^{0})^{T}\mu_{j})}\right] to evaluate downstream performance, where μj=𝔼p(v|y=j)[f(v)]\mu_{j}=\mathbb{E}_{p(v|y=j)}\left[f(v)\right].

It is easy to see that mean CE loss could indicate downstream performance as it requires nodes similar to their respective class center, and different from others class centers. Also it is an upper bound of CE loss CE=𝔼p(v0,y)[logexp(f(v0)Tωyi=1Kexp(f(v0)Tωi)]{\mathcal{L}}_{\mathrm{CE}}=\mathbb{E}_{p(v^{0},y)}\left[-\log\frac{\exp(f(v^{0})^{T}\omega_{y}}{\sum_{i=1}^{K}\exp(f(v^{0})^{T}\omega_{i})}\right], where ω\omega is parameter to train a linear classifier g(z)=Wzg(z)=Wz, W=[ω1,ω2,,ωk]W=[\omega_{1},\omega_{2},...,\omega_{k}]. Arora et al. (2019) showed that the mean classifier could achieve comparable performance to learned weights, so we analyze the mean CE loss instead of the CE loss in this paper.

Theorem 2.6 (Generalization and Augmentation Distance).

If Assumption 2.1 holds, and ReLU\mathrm{ReLU} is applied as activation, then the relationship between downstream performance and InfoNCE\mathrm{InfoNCE} loss could be represented as:

^CE\displaystyle\hat{{\mathcal{L}}}_{\mathrm{CE}} NCE3δaug22δauglogMK12Var(f(v+)|y)\displaystyle\geq{\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-\log\frac{M}{K}-\frac{1}{2}\operatorname*{Var}(f(v^{+})|y)
Var(f(v0)|y)eVar(μy)O(M12),\displaystyle\hskip 15.00002pt-\sqrt{\operatorname*{Var}(f(v^{0})|y)}-e\operatorname*{Var}(\mu_{y})-O(M^{-\frac{1}{2}}),

where MM is number of negative samples111the generalization are correlated with logMO(M12)-\log M-O(M^{-\frac{1}{2}}), which is decreasing when MM increases and MM is large, so the theorem encourages large negative samples., KK is number of classes.

The proof can be found in Appendix A.2. Theorem 2.6 gives a lower bound on the mean CE loss, we find that when we perform stronger augmentation, the lower bound would be smaller. The smaller lower bound does not enforce a better performance, but it shows a potential better solution. When the lower bound becomes smaller, the best solution is better so the model potentially performs better. For example, if there exists two models with ^CE0.7\hat{{\mathcal{L}}}_{\mathrm{CE}}\geq 0.7 and ^CE0.3\hat{{\mathcal{L}}}_{\mathrm{CE}}\geq 0.3 respectively, the latter one would be prefered as it is more likely to perform better, and the former one can never achieve performance better than 0.7. The upper bound instead shows the worst case, smaller upper bound means that the model could perform better at the worst case. From experimental results shown in Appendix C.2, we can observe that the downstream performance tends to be better with stronger augmentation which corresponds to the decreasing lower bound, so the model is powerful enough to be close to the lower bound. Therefore, a smaller lower bound could lead to better performance.

Theorem 2.6 suggests a gap between ^CE\hat{{\mathcal{L}}}_{\mathrm{CE}} and NCE{\mathcal{L}}_{\mathrm{NCE}}, meaning that the encoders that minimize NCE{\mathcal{L}}_{\mathrm{NCE}} may not yield optimal performance on downstream tasks. Furthermore, it suggests that a higher augmentation distance δaug\delta_{aug} would make the bound smaller and enhance generalization, improve performance on downstream tasks. This aligns with previous finding that a stronger augmentation helps downstream performance. Also Inequality (1) demonstrates that the positive center distance is positively related to δaug\delta_{aug}, so better generalization correlates with higher positive center distance. This aligns with the experiments before that better downstream performance may come with a high positive center distance.

Theorem 2.6 also highlights the significance of augmentation magnitude in graph contrastive learning algorithms like GRACE (Zhu et al., 2020). A weak augmentation leads to better alignment but also a weak generalization, InfoNCE loss might be relatively low but the downstream performance could be terrible (Saunshi et al., 2022). When augmentation gets stronger, although perfect alignment cannot be achieved, it promotes better generalization and potentially leads to improved downstream performance. And when the augmentation is too strong, minimizing the InfoNCE loss becomes challenging (Li et al., 2022), leading to poorer downstream performance. Therefore, it is crucial to determine the magnitude of augmentation and how to perform augmentation as it directly affects contrastive performance and generalization.

3 Finding Better Augmentation

Previous sections have revealed that perfect alignment, may not help downstream performance. Instead a stronger augmentation that leads to larger δaug\delta_{aug} will benefit generalization while weakening contrastive learning process. Therefore, we need to find out how to perform augmentation to strike a better balance between augmentation distance and contrastive loss, leading to better downstream performance.

3.1 Information Theory Perspective

Due to the inherent connection between contrastive learning and information theory, we try to analyse it through information perspective. As shown by Oord et al. (2018), NCE{\mathcal{L}}_{\mathrm{NCE}} is a lower bound of mutual information. And, Var(f(v0)|y)\operatorname*{Var}(f(v^{0})|y), Var(f(v+|y))\operatorname*{Var}(f(v^{+}|y)) and Var(μy)\operatorname*{Var}(\mu_{y}) can be represented by inherent properties of the graph and the augmentation distance δaug\delta_{aug}. Thus, we can understand the process through information and augmentation, we can reformulate Theorem 2.6 as follows:

Corollary 3.1 (CE with Mutual Information).

If Assumption 2.1 holds, the relationship between downstream performance, mutual information between views and augmentation distance could be represented as:

^CElog(K)I(v1;v2)g(δaug)O(M12),\hat{{\mathcal{L}}}_{\mathrm{CE}}\geq\log(K)-I(v^{1};v^{2})-g(\delta_{aug})-O(M^{-\frac{1}{2}}),

where I(v1;v2)I(v^{1};v^{2}) stands for the mutual information between v1v^{1} and v2v^{2}, g(δaug)g(\delta_{aug}) is monotonically increasing with δaug\delta_{aug}, and is defined in Appendix A.3.

The proof can be found in Appendix A.3. Corollary 3.1 suggests that the best augmentation would be one that maximize the mutual information and augmentation distance. Tian et al. (2020) propose that a good augmentation should minimize I(v1;v2)I(v^{1};v^{2}) while preserve as much downstream related information as possible, i.e., I(v1;y)=I(v2;y)=I(v0;y)I(v^{1};y)=I(v^{2};y)=I(v^{0};y). However, downstream tasks is unknown while pretraining, so this is actually impossible to achieve. Our theory indicates that the augmentation should be strong while preserving as much information as possible, and the best augmentation should be the one satisfying InfoMin which means the augmentation gets rid of all useless information and keeps the downstream related ones. So InfoMin propose the ideal augmentation which can not be achieved, and we propose an actual target to train a better model.

To verify our theory, we propose a simple but effective method. We first recognize important nodes, features and edges, then leave them unchanged during augmentation to increase mutual information. Then for those unimportant ones, we should perform stronger augmentation to increase the augmentation distance.

We utilize gradients to identify which feature of node vv is relatively important and carries more information. We calculate the importance of feature by averaging the feature importance across all nodes, the importance of node vv could be calculated by simply averaging the importance of its features, and then use the average of the two endpoints to represent the importance of an edge:

αv,p=NCExv,p,αp=ReLU(1|V|vαv,p),αv=ReLU(1|P|pαv,p),αei,j=(αvi+αvj)/2,\begin{split}&\alpha_{v,p}=\frac{\partial{\mathcal{L}}_{\mathrm{NCE}}}{\partial x_{v,p}},\quad\alpha_{p}=\mathrm{ReLU}\left(\frac{1}{|V^{\prime}|}\sum_{v}\alpha_{v,p}\right),\\ &\alpha_{v}=\mathrm{ReLU}\left(\frac{1}{|P^{\prime}|}\sum_{p}\alpha_{v,p}\right),\quad\alpha_{e_{i,j}}=\left(\alpha_{v_{i}}+\alpha_{v_{j}}\right)/2,\end{split}

where αv,p\alpha_{v,p} is importance of the pthp^{th} feature of node vv, αp\alpha_{p} is the importance of pthp^{th} feature, αv\alpha_{v} iss importance of node vv, and αei,j\alpha_{e_{i,j}} means the importance of edge (vi,vj)(v_{i},v_{j}).

For those edges/features with high importance, we should keep them steady and do no modification during augmentation. For those with relatively low importance, we can freely mask those edges/features, but we should make sure that the number of masked edges/features is greater than the number of kept ones to prevent δaug\delta_{aug} from decreasing. The process can be described by the following equation:

𝑨~=𝑨(𝑴e𝑺e𝑫e),𝑭~=𝑭(𝑴f𝑺f𝑫f),\tilde{{\bm{A}}}={\bm{A}}\ast({\bm{M}}_{e}\vee{\bm{S}}_{e}\wedge{\bm{D}}_{e}),\quad\tilde{{\bm{F}}}={\bm{F}}\ast({\bm{M}}_{f}\vee{\bm{S}}_{f}\wedge{\bm{D}}_{f}),

where \ast is hadamard product, \vee stands for logical OR, \wedge stands for logical AND. 𝑴e{\bm{M}}_{e}, 𝑴f{\bm{M}}_{f} represent the random mask matrix, which could be generated using any masking method, 𝑺e{\bm{S}}_{e}, 𝑺f{\bm{S}}_{f} are the importance based retain matrix, it tells which edge/feature is of high importance and should be retained. For the top ξ\xi important edges/features, we set 𝑺e{\bm{S}}_{e}, 𝑺f{\bm{S}}_{f} to 11 with a probability of 50% and to 0 otherwise. 𝑫e{\bm{D}}_{e}, 𝑫f{\bm{D}}_{f} show those edges/features should be deleted to increase δaug\delta_{aug}, for the least 2ξ\xi important edges/features, we also set 𝑫e{\bm{D}}_{e}, 𝑫f{\bm{D}}_{f} to 0 with a probability of 50% and to 11 otherwise.

This is a simple method, and the way to measure importance can be replaced by any other methods. It can be easily integrated into any other graph contrastive learning methods that require edge/feature augmentation. There are many details that could be optimized, such as how to choose which edges/features to delete and the number of deletions. However, since this algorithm is primarily intended for theoretical verification, we just randomly select edges to be deleted and set the number to be deleted as twice the number of edges kept.

In fact, most graph contrastive learning methods follow a similar framework as discussed in Appendix B.1.

Table 1: Quantitative results on node classification, algorithm+I stands for the algorithm with information augmentation, and algorithm+S stands for the algorithm with spectrum augmentation. We show the error bar in Figure 12
Methods Cora CiteSeer PubMed DBLP Photo Computers mean p-value
Baseline Supervised GCN 83.31±0.07 69.81±0.98 85.36±0.09 81.26±0.01 93.28±0.03 88.11±0.14 81.93 -
Supervised GAT 83.83±0.30 70.31±0.65 84.04±0.40 81.92±0.03 93.17±0.05 86.82±0.09 81.57 -
GRACE+SpCo 83.45±0.79 69.9±1.24 OOM 83.61±0.14 91.56±0.19 83.37±0.38 81.01 -
GCS 83.39±0.54 68.73±1.68 84.92±0.19 83.38±0.37 90.15±0.24 86.54±0.26 81.97 -
Unsupervised GRACE 82.52±0.75 70.44±1.49 84.97±0.17 84.01±0.34 91.17±0.15 86.36±0.32 83.25 -
GRACE+I 83.78±1.08 72.89±0.97 84.97±0.14 84.80±0.17 91.64±0.21 87.57±0.53 84.28 0.155
GRACE+S 83.61±0.85 72.83±0.47 85.45±0.25 84.83±0.18 91.99±0.35 87.67±0.33 84.40 0.003
GCA 83.74±0.79 71.09±1.29 85.38±0.20 83.99±0.21 91.67±0.38 86.77±0.31 83.77 -
GCA+I 84.93±0.81 72.74±1.05 85.73±0.13 84.79±0.28 91.94±0.13 86.60±0.29 84.46 0.089
GCA+S 84.51±0.89 72.38±0.86 85.35±0.09 84.49±0.24 92.02±0.34 86.97±0.40 84.29 0.147
AD GCL 81.68±0.80 70.01±0.97 84.77±0.15 83.14±0.16 91.34±0.33 84.80±0.51 82.62 -
AD GCL+I 83.46±1.06 71.06±0.91 85.52±0.33 84.76±0.09 91.71±0.78 86.02±0.53 83.76 0.003
AD GCL+S 82.96±0.53 71.35±0.47 85.38±0.30 84.45±0.19 91.79±0.33 86.49±0.26 83.74 0.006

3.2 Graph Spectrum Perspective

In this section, we attempt to analyze InfoNCE loss and augmentation distance from graph spectrum perspective as graph and GNNs are deeply connected with spectrum theory. We start by representing them using the spectrum of the adjacency matrix 𝑨{\bm{A}}.

Theorem 3.2 (Theorem 1 of Liu et al. (2022) Restated).

Given adjacency martix 𝐀{\bm{A}} and the generated augmentation 𝐀,𝐀′′{\bm{A}}^{\prime},{\bm{A}}^{\prime\prime}, the ithi^{th} eigenvalues of 𝐀{\bm{A}}^{\prime} and 𝐀′′{\bm{A}}^{\prime\prime} are λi,λi′′\lambda_{i}^{\prime},\lambda_{i}^{\prime\prime}, respectively. The following upper bound is established:

NCENlogN(N+1)iθiλiλi′′,\begin{split}{\mathcal{L}}_{\mathrm{NCE}}&\geq N\log N-(N+1)\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime},\\ \end{split} (3)

where θi\theta_{i} is the adaptive weight of the ithi^{th} term, the detail of θi\theta_{i} is discussed in Appendix\mathrm{Appendix} C.1.

Corollary 3.3 (Spectral Representation of δaug\delta_{aug}).

If Assumption 2.1 holds, and λi,λi′′\lambda_{i}^{\prime},\lambda_{i}^{\prime\prime} are ithi^{th} eigenvalues of 𝐀{\bm{A}}^{\prime} and 𝐀′′{\bm{A}}^{\prime\prime}, respectively, then:

2δaug𝔼p(vi1,vi2)f(vi1)f(vi2)22Niθiλiλi′′.2\delta_{aug}\geq\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||\geq\sqrt{2-\frac{2}{N}\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}}. (4)

Theorem 2.6 suggests that we should strive to make NCE{\mathcal{L}}_{\mathrm{NCE}} small while increase δaug\delta_{aug}, but they are kindly mutually exclusive. As shown in Theorem 3.2, and Corallary 3.3 proved in Appendix A.4, when θi\theta_{i} is positive, a small NCE{\mathcal{L}}_{\mathrm{NCE}} requires for large |λi||\lambda_{i}| while a large δaug\delta_{aug} requires for small |λi||\lambda_{i}|, and it works exclusively too when θi\theta_{i} is negative. As contrastive learning is trained to minimize NCE{\mathcal{L}}_{\mathrm{NCE}}, θ\thetas are going to increase as the training goes, so we can assume that θ\thetas will be positive, the detailed discussion and exact definition of θ\theta can be found in Appendix C.1. Therefore, to achieve a better trade-off, we should decrease |λi||\lambda_{i}| while keep InfoNCE loss also decreasing.

Refer to caption
Figure 2: Augmentation distance and InfoNCE, GRACE+I stands for GRACE with information augmentation, and GRACE+S stands for GRACE with spectrum augmentation. GRACE+x_\_MI means mutual information between two views after training, and GRACE+x_δaug\_\delta_{aug} is augmentation distance caused by the method.
Refer to caption
Figure 3: Accuracy on downstream tasks with different number of layers.

In fact, reducing |λi||\lambda_{i}| actually reduces the positive λi\lambda_{i} and increases the negative λi\lambda_{i}, which is trying to smoothen the graph spectrum and narrow the gap between the spectrum. As suggested by Yang et al. (2022b), graph convolution operation with unsmooth spectrum results in signals correlated to the eigenvectors corresponding to larger magnitude eigenvalues and orthogonal to the eigenvectors corresponding to smaller magnitude eigenvalues. So with enough graph convolution operations, if |λi|>|λj||\lambda_{i}|>|\lambda_{j}|, then we can get the embedding f(v)f(v) satisfying sim(f(v),ei)sim(f(v),ej)\mathrm{sim}(f(v),e_{i})\gg\mathrm{sim}(f(v),e_{j}) where eie_{i} denotes the eigenvector corresponding to λi\lambda_{i}, causing all representations similar to eie_{i}. Therefore, an unsmooth spectrum may lead to similar representations and result in over-smooth. This can also be observed from Inequality (4), where a higher |λi||\lambda_{i}| draws f(vi1)f(v_{i}^{1}) and f(vi2)f(v_{i}^{2}) more similar.

We now know that smoothing the graph spectrum can help with graph contrastive learning. The question is how to appropriately smooth the spectrum. We propose a simple method. As the training aims to minimize NCE{\mathcal{L}}_{\mathrm{NCE}}, the parameter θi\theta_{i}s are supposed to increase. Therefore, we can use θi\theta_{i} as a symbol to show whether the model is correctly trained. When θi\theta_{i} gradually increases, we can decrease λ\lambda as needed. However, when θi\theta_{i} starts to decrease, it is likely that the change on the spectrum is too drastic, and we should take a step back. The process could be described as follows:

λi=λi+directioniλiα,directioni={1,cur(θi)pre(θi)ϵ1,cur(θi)pre(θi)ϵ,0,otherwise\begin{split}\lambda_{i}&=\lambda_{i}+\mathrm{direction}_{i}*\lambda_{i}*\alpha,\\ \mathrm{direction}_{i}&=\begin{cases}-1,&\text{$\mathrm{cur}(\theta_{i})-\mathrm{pre}(\theta_{i})\geq\epsilon$}\\ 1,&\text{$\mathrm{cur}(\theta_{i})-\mathrm{pre}(\theta_{i})\leq-\epsilon$},\\ 0,&\text{otherwise}\end{cases}\end{split}

where α\alpha is a hyperparameter that determines how much we should decrease/increase λi\lambda_{i}. ϵ\epsilon is used to determine whether θi\theta_{i} is increasing, decreasing, or just staying steady. cur(θi)\mathrm{cur}(\theta_{i}) and pre(θi)\mathrm{pre}(\theta_{i}) represents the current and previous θi\theta_{i}.

In this way, the contrastive training will increase θi\theta_{i} and result in a lower NCE{\mathcal{L}}_{\mathrm{NCE}}, while we justify λi\lambda_{i} to achieve a better augmentation distance, which promises a better generalization ability. Also some spectral augmentations implicitly decreases |λ||\lambda|s as shown in Appendix B.2.

4 Experiments

In this section, we mainly evaluate the performance of the methods we proposed on six datasets: Cora, CiteSeer, PubMed, DBLP, Amazon-Photo and Amazon-Computer. We select 3 contrastive learning GNN, GRACE (Zhu et al., 2020), GCA (Zhu et al., 2021) and AD-GCL (Suresh et al., 2021), then we integrate those models with our proposed methods to verify its applicability and correctness of the theory. Details of datasets and baselines are shown in Appendix D.1. The results are summarized in Table 1. We further investigate the positive/negative center distance in Appendix D.4, the hyperparameter sensitivity is studied in Appendix D.5, and the change of θ\theta and the spectrum while training is shown in Appendix D.3.

From Table 1 shows that GRACE+I (GRACE with information augmentation) and GRACE+S (GRACE with spectrum augmentation) both improve the downstream performance. This improvement is significant for GRACE since it primarily performs random dropout, resulting in the loss of valuable information. But for GCA, the performance gain is relatively weak as GCA already drops the unimportant ones with a higher probability, allowing it to capture sufficient information. AD-GCL aggressively drops as much information as possible and some important ones are also dropped, so our methods help greatly. Overall, our methods improve the performance of original algorithm and helps downstream tasks, the p-value on the averaged performance shown in Table 1 also prove that our method is effective. We further discuss the two different methods and combine then in Appendix D.7. Also we conduct further discussion on some augmentation free methods in Appendix C.4.

4.1 Augmentation Distance

Figure 2 shows that for all three algorithms, our augmentation methodologies can conduct stronger augmentation while preserving similar mutual information. In this way, our methods achieve higher augmentation distance while capturing similar information of the original view. So our methods achieve similar contrastive loss with better generalization, resulting in improved downstream performance.

4.2 Over-smooth

While reducing |λi||\lambda_{i}|, we obtain a graph with smoother spectrum, and could relieve the over-smooth by preventing nodes being too similar with the eigenvector corresponding to the largest eigenvalue. This enables the application of relatively more complex models. We can verify this by simply stacking more layers. As shown in Figure 3, if applied spectrum augmentation, the model tends to outperform the original algorithm especially with more layer, and the best performance may come with a larger number of layers, which indicates that more complicated models could be applied and our method successfully relieve over-smooth.

5 Conclusion

In this paper, we study the impact of contrastive learning on downstream tasks and propose that perfect alignment does not necessarily lead to better performance. Instead, we find that a relatively large augmentation distance is more beneficial for generalization by enlarging the distance of inter-class nodes. We further study how the augmentation influences contrastive learning by information theory and the graph spectrum theory and propose two effective methods.

Impact Statement

This work studies the theory and algorithm of Graph Contrastive Learning, which does not present any foreseeable societal consequence.

Acknowledgements

We thank the anonymous reviewers for their valuable and constructive suggestions and comments. This work is supported by the Beijing Natural Science Foundation (No.4222029); the National Natural Science Foundation of China (N0.62076234); the National Key Research and Development Project (No.2022YFB2703102); the “Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China”; the Beijing Outstanding Young Scientist Program (NO.BJJWZYJH012019100020098); the Public Computing Cloud, Renmin University of China; the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (NO.2021030199), the Huawei-Renmin University joint program on Information Retrieval: the Unicom Innovation Ecological Cooperation Plan; the CCF-Huawei Populus Grove Fund.

References

  • Arora et al. (2019) Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  • Ash et al. (2021) Ash, J. T., Goel, S., Krishnamurthy, A., and Misra, D. Investigating the role of negatives in contrastive representation learning. arXiv preprint arXiv:2106.09943, 2021.
  • Bojchevski & Günnemann (2017) Bojchevski, A. and Günnemann, S. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815, 2017.
  • Budimir et al. (2000) Budimir, I., Dragomir, S. S., and Pecaric, J. Further reverse results for jensen’s discrete inequality and applications in information theory. RGMIA research report collection, 3(1), 2000.
  • Chen et al. (2020) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  • Guo et al. (2023) Guo, X., Wang, Y., Wei, Z., and Wang, Y. Architecture matters: Uncovering implicit mechanisms in graph contrastive learning. arXiv preprint arXiv:2311.02687, 2023.
  • Hassani & Khasahmadi (2020) Hassani, K. and Khasahmadi, A. H. Contrastive multi-view representation learning on graphs. In International conference on machine learning, pp.  4116–4126. PMLR, 2020.
  • He et al. (2020) He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  639–648, 2020.
  • Huang et al. (2021) Huang, W., Yi, M., and Zhao, X. Towards the generalization of contrastive self-supervised learning. arXiv preprint arXiv:2111.00743, 2021.
  • Jia & Zhang (2022) Jia, B.-B. and Zhang, M.-L. Multi-dimensional classification via selective feature augmentation. Machine Intelligence Research, 19(1):38–51, 2022.
  • Jiang et al. (2019) Jiang, B., Zhang, Z., Lin, D., Tang, J., and Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11313–11320, 2019.
  • Khosla et al. (2020) Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  • Li et al. (2023) Li, H., Cao, J., Zhu, J., Luo, Q., He, S., and Wang, X. Augmentation-free graph contrastive learning of invariant-discriminative representations. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • Li et al. (2022) Li, S., Wang, X., Zhang, A., Wu, Y., He, X., and Chua, T.-S. Let invariant rationale discovery inspire graph contrastive learning. In International conference on machine learning, pp.  13052–13065. PMLR, 2022.
  • Lin et al. (2022) Lin, L., Chen, J., and Wang, H. Spectral augmentation for self-supervised learning on graphs. arXiv preprint arXiv:2210.00643, 2022.
  • Liu et al. (2022) Liu, N., Wang, X., Bo, D., Shi, C., and Pei, J. Revisiting graph contrastive learning from the perspective of graph spectrum. Advances in Neural Information Processing Systems, 35:2972–2983, 2022.
  • Liu et al. (2018) Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. Constrained graph variational autoencoders for molecule design. Advances in neural information processing systems, 31, 2018.
  • Mo et al. (2022) Mo, Y., Peng, L., Xu, J., Shi, X., and Zhu, X. Simple unsupervised graph representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  7797–7805, 2022.
  • Mumuni et al. (2024) Mumuni, A., Mumuni, F., and Gerrar, N. K. A survey of synthetic data augmentation methods in machine vision. Machine Intelligence Research, pp.  1–39, 2024.
  • Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Peng et al. (2020) Peng, Z., Huang, W., Luo, M., Zheng, Q., Rong, Y., Xu, T., and Huang, J. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, pp.  259–270, 2020.
  • Saunshi et al. (2022) Saunshi, N., Ash, J., Goel, S., Misra, D., Zhang, C., Arora, S., Kakade, S., and Krishnamurthy, A. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pp.  19250–19286. PMLR, 2022.
  • Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • Shchur et al. (2018) Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
  • Stewart (1990) Stewart, G. W. Matrix perturbation theory. 1990.
  • Suresh et al. (2021) Suresh, S., Li, P., Hao, C., and Neville, J. Adversarial graph augmentation to improve graph contrastive learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  15920–15933. Curran Associates, Inc., 2021.
  • Tang & Liu (2023) Tang, H. and Liu, Y. Towards understanding generalization of graph neural networks. In International Conference on Machine Learning, pp.  33674–33719. PMLR, 2023.
  • Tian et al. (2020) Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
  • Trivedi et al. (2022) Trivedi, P., Lubana, E. S., and Koutra, D. Understanding self-supervised graph representation learning from a data-centric perspective. 2022.
  • Veličković et al. (2018) Veličković, P., Fedus, W., Hamilton, W. L., Liò, P., Bengio, Y., and Hjelm, R. D. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
  • Wang et al. (2022a) Wang, H., Guo, X., Deng, Z.-H., and Lu, Y. Rethinking minimal sufficient representation in contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16041–16050, 2022a.
  • Wang & Isola (2020) Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp.  9929–9939. PMLR, 2020.
  • Wang et al. (2022b) Wang, Y., Zhang, Q., Wang, Y., Yang, J., and Lin, Z. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. arXiv preprint arXiv:2203.13457, 2022b.
  • Wei et al. (2023) Wei, C., Wang, Y., Bai, B., Ni, K., Brady, D., and Fang, L. Boosting graph contrastive learning via graph contrastive saliency. In International Conference on Machine Learning, pp.  36839–36855. PMLR, 2023.
  • Wu et al. (2019) Wu, Z., Pan, S., Long, G., Jiang, J., and Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121, 2019.
  • Xu et al. (2021) Xu, D., Cheng, W., Luo, D., Chen, H., and Zhang, X. Infogcl: Information-aware graph contrastive learning, 2021.
  • Yang et al. (2022a) Yang, L., Kang, L., Zhang, Q., Li, M., He, D., Wang, Z., Wang, C., Cao, X., Guo, Y., et al. Open: Orthogonal propagation with ego-network modeling. Advances in Neural Information Processing Systems, 35:9249–9261, 2022a.
  • Yang et al. (2023) Yang, L., Zhang, Q., Shi, R., Zhou, W., Niu, B., Wang, C., Cao, X., He, D., Wang, Z., and Guo, Y. Graph neural networks without propagation. In Proceedings of the ACM Web Conference 2023, pp.  469–477, 2023.
  • Yang et al. (2024) Yang, L., Shi, R., Zhang, Q., Wang, Z., Cao, X., Wang, C., et al. Self-supervised graph neural networks via low-rank decomposition. Advances in Neural Information Processing Systems, 36, 2024.
  • Yang et al. (2022b) Yang, M., Shen, Y., Li, R., Qi, H., Zhang, Q., and Yin, B. A new perspective on the effects of spectrum in graph neural networks. In International Conference on Machine Learning, pp.  25261–25279. PMLR, 2022b.
  • Yang et al. (2016) Yang, Z., Cohen, W., and Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp.  40–48. PMLR, 2016.
  • You et al. (2020) You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
  • Yuan et al. (2022) Yuan, Y., Xu, B., Shen, H., Cao, Q., Cen, K., Zheng, W., and Cheng, X. Towards generalizable graph contrastive learning: An information theory perspective. arXiv preprint arXiv:2211.10929, 2022.
  • Zhang et al. (2022) Zhang, Y., Zhu, H., Song, Z., Koniusz, P., and King, I. Costa: covariance-preserving feature augmentation for graph contrastive learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  2524–2534, 2022.
  • Zhao et al. (2022) Zhao, T., Jin, W., Liu, Y., Wang, Y., Liu, G., Günnemann, S., Shah, N., and Jiang, M. Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871, 2022.
  • Zhu et al. (2020) Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
  • Zhu et al. (2021) Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pp.  2069–2080, 2021.

Appendix A Theoretical Proof

A.1 Proof of Theorem 2.4

If we set δy+2=𝔼p(y,i,j)f(vy,i0)f(vy,j0)2\delta_{y^{+}}^{2}=\mathbb{E}_{p(y,i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y,j})||^{2}, and δy+2=𝔼p(y,y,i,j)f(vy,i0)f(vy,j0)2\delta_{y^{+}}^{2}=\mathbb{E}_{p(y,y^{\prime},i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y^{\prime},j})||^{2}. Then with Assumption 2.1 and jensen inequality, we know that 𝔼p(vi)f(vi0)f(vi+)2δaug2\mathbb{E}_{p(v_{i})}||f(v_{i}^{0})-f(v_{i}^{+})||^{2}\leq\delta_{aug}^{2}, 𝔼p(vi)f(vi0)f(vi+)δaug\mathbb{E}_{p(v_{i})}||f(v_{i}^{0})-f(v_{i}^{+})||\leq\delta_{aug} and 𝔼p(y,i,j)f(vy,i0)f(vy,j0)δy+\mathbb{E}_{p(y,i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y,j})||\leq\delta_{y^{+}}, 𝔼p(y,y,i,j)f(vy,i0)f(vy,j0)δy\mathbb{E}_{p(y,y^{\prime},i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y^{\prime},j})||\leq\delta_{y^{-}}. Therefore, we can get the inequality below:

𝔼p(vy,i,vy,j|y)f(vy,i+)f(vy,j0)2\displaystyle\mathbb{E}_{p(v_{y,i},v_{y,j}|y)}||f(v^{+}_{y,i})-f(v^{0}_{y,j})||^{2} 𝔼p(vy,i,vy,j|y)f(vy,i+)f(vy,i0)2+𝔼p(vy,i,vy,j|y)f(vy,i0)f(vy,j0)2\displaystyle\leq\mathbb{E}_{p(v_{y,i},v_{y,j}|y)}||f(v^{+}_{y,i})-f(v^{0}_{y,i})||^{2}+\mathbb{E}_{p(v_{y,i},v_{y,j}|y)}||f(v^{0}_{y,i})-f(v^{0}_{y,j})||^{2}
+2𝔼p(vy,i,vy,j|y)f(vy,i+)f(vy,j0)f(vy,i0)f(vy,j0)\displaystyle\hskip 15.00002pt+2\mathbb{E}_{p(v_{y,i},v_{y,j}|y)}||f(v^{+}_{y,i})-f(v^{0}_{y,j})||\cdot||f(v^{0}_{y,i})-f(v^{0}_{y,j})||
δaug2+δy+2+2δaugδy+\displaystyle\leq\delta_{aug}^{2}+\delta_{y^{+}}^{2}+2\delta_{aug}\delta_{y^{+}}
=(δaug+δy+)2.\displaystyle=(\delta_{aug}+\delta_{y^{+}})^{2}.

As μy=𝔼p(vy|y)[f(vy)]=13𝔼p(vy0|y)f(vy0)+23𝔼p(vy+|y)f(vy+)\mu_{y}=\mathbb{E}_{p(v_{y}|y)}[f(v_{y})]=\frac{1}{3}\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})+\frac{2}{3}\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{+}), we know that,

𝔼p(vy0|y)f(vy0)μy=𝔼p(vy0|y)f(vy0)13𝔼p(vy0|y)f(vy0)23𝔼p(vy+|y)f(vy+)=𝔼p(vy0|y)13(f(vy0)𝔼p(vy0|y)f(vy0))+23(f(vy0)𝔼p(vy+|y)f(vy+))𝔼p(vy0|y)[13(f(vy0)𝔼p(vy0|y)f(vy0))+23(f(vy0)𝔼p(vy+|y)f(vy+))]𝔼p(vy0|y)𝔼p(vy0|y)13(f(vy0)f(vy0))+𝔼p(vy0|y)𝔼p(vy+|y)23f(vy0)f(vy+)13δy++23(δaug+δy+)=δy++δaug\begin{split}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}||f(v_{y}^{0^{\prime}})-\mu_{y}||&=\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}||f(v_{y}^{0^{\prime}})-\frac{1}{3}\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})-\frac{2}{3}\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{+})||\\ &=\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\left\|\frac{1}{3}\left(f(v_{y}^{0^{\prime}})-\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})\right)+\frac{2}{3}\left(f(v_{y}^{0^{\prime}})-\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{+})\right)\right\|\\ &\leq\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\left[\left\|\frac{1}{3}\left(f(v_{y}^{0^{\prime}})-\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})\right)\right\|+\left\|\frac{2}{3}\left(f(v_{y}^{0^{\prime}})-\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{+})\right)\right\|\right]\\ &\leq\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{0}|y)}\frac{1}{3}||(f(v_{y}^{0^{\prime}})-f(v_{y}^{0}))||+\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{+}|y)}\frac{2}{3}||f(v_{y}^{0^{\prime}})-f(v_{y}^{+})||\\ &\leq\frac{1}{3}\delta_{y^{+}}+\frac{2}{3}(\delta_{aug}+\delta_{y^{+}})\\ &=\delta_{y^{+}}+\delta_{aug}\end{split}

Similarly, we know that 𝔼p(vy0|y)f(vy0)μyδy+δaug\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}||f(v_{y}^{0^{\prime}})-\mu_{y^{-}}||\leq\delta_{y^{-}}+\delta_{aug}

Next we prove a bound for 𝔼p(vy0,y|y)f(vy0)Tμy\mathbb{E}_{p(v_{y}^{0},y^{-}|y)}f(v_{y}^{0})^{T}\mu_{y^{-}} for other use. As μy=𝔼p(vy|y)[f(vy)]=13𝔼p(vy0|y)f(vy0)+23𝔼p(vy+|y)f(vy+)\mu_{y}=\mathbb{E}_{p(v_{y}|y)}[f(v_{y})]=\frac{1}{3}\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})+\frac{2}{3}\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{+}), we know that,

𝔼p(vy0|y)f(vy0)Tμy\displaystyle\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}f(v_{y}^{0^{\prime}})^{T}\mu_{y} =𝔼p(vy0|y)f(vy0)T(13𝔼p(vy0|y)f(vy0)+23𝔼p(vy+|y)f(vy0))\displaystyle=\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}f(v_{y}^{0^{\prime}})^{T}(\frac{1}{3}\mathbb{E}_{p(v_{y}^{0}|y)}f(v_{y}^{0})+\frac{2}{3}\mathbb{E}_{p(v_{y}^{+}|y)}f(v_{y}^{0}))
=13𝔼p(vy0|y)𝔼p(vy0|y)f(vy0)Tf(vy0)+23𝔼p(vy0|y)𝔼p(vy+|y)f(vy0)Tf(vy+).\displaystyle=\frac{1}{3}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{0}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{0})+\frac{2}{3}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{+}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{+}).

assume that 𝔼p(a,b)ab2c2\mathbb{E}_{p(a,b)}||a-b||^{2}\leq c^{2}, a=b=1||a||=||b||=1, then

𝔼p(a,b)(aTbT)(ab)c2\displaystyle\mathbb{E}_{p(a,b)}(a^{T}-b^{T})(a-b)\leq c^{2}
𝔼p(a,b)[aTaaTbbTa+bTb]c2\displaystyle\mathbb{E}_{p(a,b)}[a^{T}a-a^{T}b-b^{T}a+b^{T}b]\leq c^{2}
𝔼p(a,b)[22aTb]c2\displaystyle\mathbb{E}_{p(a,b)}[2-2a^{T}b]\leq c^{2}
𝔼p(a,b)aTb2c22=1c22.\displaystyle\mathbb{E}_{p(a,b)}a^{T}b\geq\frac{2-c^{2}}{2}=1-\frac{c^{2}}{2}.

As we already know that 𝔼p(y,y,i,j)f(vy,i0)f(vy,j0)2δy+2\mathbb{E}_{p(y,y^{\prime},i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y^{\prime},j})||^{2}\leq\delta_{y^{+}}^{2} and 𝔼p(vy,i,vy,j|y)f(vy,i+)f(vy,j0)2(δaug+δy+)2\mathbb{E}_{p(v_{y,i},v_{y,j}|y)}||f(v^{+}_{y,i})-f(v^{0}_{y,j})||^{2}\leq(\delta_{aug}+\delta_{y^{+}})^{2}. So 𝔼p(vy0|y)𝔼p(vy0|y)f(vy0)Tf(vy0)1δy+22\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{0}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{0})\geq 1-\frac{\delta_{y^{+}}^{2}}{2} and 𝔼p(vy0|y)𝔼p(vy+|y)f(vy0)Tf(vy+)1(δaug+δy+)22\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{+}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{+})\geq 1-\frac{(\delta_{aug}+\delta_{y^{+}})^{2}}{2}.

Then, we can calculate 𝔼p(vy0|y)f(vy0)Tμy\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}f(v_{y}^{0^{\prime}})^{T}\mu_{y} as below:

𝔼p(vy0|y)f(vy0)Tμy=13𝔼p(vy0|y)𝔼p(vy0|y)f(vy0)Tf(vy0)+23𝔼p(vy0|y)𝔼p(vy+|y)f(vy0)Tf(vy+)1δaug232δaugδy+3δy+22.\begin{split}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}f(v_{y}^{0^{\prime}})^{T}\mu_{y}&=\frac{1}{3}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{0}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{0})+\frac{2}{3}\mathbb{E}_{p(v_{y}^{0^{\prime}}|y)}\mathbb{E}_{p(v_{y}^{+}|y)}f(v^{0^{\prime}}_{y})^{T}f(v_{y}^{+})\\ &\geq 1-\frac{\delta_{aug}^{2}}{3}-\frac{2\delta_{aug}\delta_{y^{+}}}{3}-\frac{\delta_{y^{+}}^{2}}{2}.\end{split} (5)

Similarly, we know that 𝔼p(vy0,y|y)f(vy0)Tμy1δaug232δaugδy3δy22\mathbb{E}_{p(v_{y}^{0},y^{-}|y)}f(v_{y}^{0})^{T}\mu_{y^{-}}\geq 1-\frac{\delta_{aug}^{2}}{3}-\frac{2\delta_{aug}\delta_{y^{-}}}{3}-\frac{\delta_{y^{-}}^{2}}{2}.

A.2 Proof of Theorem 2.6

^CE=𝔼p(vi0,y)f(vi0)TμyΛ1+𝔼p(vi0)logi=jKexp(f(vi0)Tμj)Λ2.\hat{{\mathcal{L}}}_{\mathrm{CE}}=\underbrace{-\mathbb{E}_{p(v_{i}^{0},y)}f(v_{i}^{0})^{T}\mu_{y}}_{\Lambda_{1}}+\underbrace{\mathbb{E}_{p(v_{i}^{0})}\log\sum_{i=j}^{K}\exp(f(v_{i}^{0})^{T}\mu_{j})}_{\Lambda_{2}}.
Λ1\displaystyle\Lambda_{1} =𝔼p(vi0,y)f(vi0)Tμy\displaystyle=-\mathbb{E}_{p(v_{i}^{0},y)}f(v_{i}^{0})^{T}\mu_{y}
=𝔼p(vi0,y)[f(vi0)Tf(vi+)+f(vi0)T(μyf(vi+))]\displaystyle=-\mathbb{E}_{p(v_{i}^{0},y)}\left[f(v_{i}^{0})^{T}f(v_{i}^{+})+f(v_{i}^{0})^{T}(\mu_{y}-f(v_{i}^{+}))\right]
(a)𝔼p(vi0,vi+,y)f(vi0)Tf(vi+)𝔼p(vi+,y)f(vi+)μy\displaystyle\overset{(a)}{\geq}-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}f(v_{i}^{0})^{T}f(v_{i}^{+})-\mathbb{E}_{p(v_{i}^{+},y)}||f(v_{i}^{+})-\mu_{y}||
𝔼p(vi0,vi+,y)f(vi0)Tf(vi+)𝔼p(vi0,vi+,y)f(vy+)f(vy0)𝔼p(vi0,vi+,y)f(vy0)μy\displaystyle\geq-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}f(v_{i}^{0})^{T}f(v_{i}^{+})-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}||f(v_{y}^{+})-f(v_{y}^{0})||-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}||f(v^{0}_{y})-\mu_{y}||
(b)𝔼p(vi0,vi+,y)f(vi1)Tf(vi2)3δaug2δaug𝔼p(vi0,vi+,y)f(vy0)μy.\displaystyle\overset{(b)}{\geq}-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}f(v_{i}^{1})^{T}f(v_{i}^{2})-3\delta_{aug}^{2}-\delta_{aug}-\mathbb{E}_{p(v_{i}^{0},v_{i}^{+},y)}||f(v^{0}_{y})-\mu_{y}||.

(a) f(vi0)T(μyf(vi+))(μyf(vi+)μyf(vi+))T(μyf(vi+))=μyf(vi+)f(v_{i}^{0})^{T}(\mu_{y}-f(v_{i}^{+}))\leq(\frac{\mu_{y}-f(v_{i}^{+})}{||\mu_{y}-f(v_{i}^{+})||})^{T}(\mu_{y}-f(v_{i}^{+}))=||\mu_{y}-f(v_{i}^{+})||.

(b) 𝔼p(vi0,vi+)f(vi0)f(vi+)2δaug2\mathbb{E}_{p(v_{i}^{0},v_{i}^{+})}||f(v_{i}^{0})-f(v_{i}^{+})||^{2}\leq\delta_{aug}^{2}, then:

δaug2\displaystyle\delta_{aug}^{2} 𝔼p(vi0,vi1)(f(vi0)f(vi1))T(f(vi0)f(vi1))\displaystyle\geq\mathbb{E}_{p(v_{i}^{0},v_{i}^{1})}(f(v_{i}^{0})-f(v_{i}^{1}))^{T}\cdot(f(v_{i}^{0})-f(v_{i}^{1}))
=𝔼p(vi0,vi1,vi2)(f(vi0)f(vi1))T(f(vi0)f(vi1)+f(vi2)f(vi2))\displaystyle=\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}(f(v_{i}^{0})-f(v_{i}^{1}))^{T}\cdot(f(v_{i}^{0})-f(v_{i}^{1})+f(v_{i}^{2})-f(v_{i}^{2}))
=𝔼p(vi0,vi1,vi2)f(vi0)Tf(vi0)f(vi0)Tf(vi1)+f(vi0)Tf(vi2)f(vi0)Tf(vi2)\displaystyle=\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{0})^{T}f(v_{i}^{0})-f(v_{i}^{0})^{T}f(v_{i}^{1})+f(v_{i}^{0})^{T}f(v_{i}^{2})-f(v_{i}^{0})^{T}f(v_{i}^{2})
f(vi1)Tf(vi0)+f(vi1)Tf(vi1)f(vi1)Tf(vi2)+f(vi1)Tf(vi2)\displaystyle\hskip 50.00008pt-f(v_{i}^{1})^{T}f(v_{i}^{0})+f(v_{i}^{1})^{T}f(v_{i}^{1})-f(v_{i}^{1})^{T}f(v_{i}^{2})+f(v_{i}^{1})^{T}f(v_{i}^{2})
=2+𝔼p(vi0,vi1,vi2)[2f(vi0)Tf(vi1)+f(vi0)Tf(vi2)f(vi0)Tf(vi2)f(vi1)Tf(vi2)+f(vi1)Tf(vi2)]\displaystyle=2+\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}\left[-2f(v_{i}^{0})^{T}f(v_{i}^{1})+f(v_{i}^{0})^{T}f(v_{i}^{2})-f(v_{i}^{0})^{T}f(v_{i}^{2})-f(v_{i}^{1})^{T}f(v_{i}^{2})+f(v_{i}^{1})^{T}f(v_{i}^{2})\right]
(c)22+𝔼p(vi0,vi1,vi2)[f(vi0)Tf(vi2)1f(vi1)Tf(vi2)+12δaug2]\displaystyle\overset{(c)}{\geq}2-2+\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}\left[f(v_{i}^{0})^{T}f(v_{i}^{2})-1-f(v_{i}^{1})^{T}f(v_{i}^{2})+1-2\delta_{aug}^{2}\right]
=𝔼p(vi0,vi1,vi2)[f(vi0)Tf(vi2)f(vi1)Tf(vi2)]2δaug2.\displaystyle=\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}\left[f(v_{i}^{0})^{T}f(v_{i}^{2})-f(v_{i}^{1})^{T}f(v_{i}^{2})\right]-2\delta_{aug}^{2}.

So, we can get the relationship between 𝔼p(vi0,vi1,vi2)f(vi0)Tf(vi2)\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{0})^{T}f(v_{i}^{2}) and 𝔼p(vi0,vi1,vi2)f(vi1)Tf(vi2)2δaug2\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})-2\delta_{aug}^{2} as below:

δaug2𝔼p(vi0,vi1,vi2)f(vi0)Tf(vi2)𝔼p(vi0,vi1,vi2)f(vi1)Tf(vi2)2δaug2,\displaystyle\delta_{aug}^{2}\geq\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{0})^{T}f(v_{i}^{2})-\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})-2\delta_{aug}^{2},
𝔼p(vi0,vi1,vi2)f(vi0)Tf(vi2)𝔼p(vi0,vi1,vi2)f(vi1)Tf(vi2)+3δaug2.\displaystyle\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{0})^{T}f(v_{i}^{2})\leq\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})+3\delta_{aug}^{2}.

As vi2v_{i}^{2} is an augmented node, we can get that,

𝔼p(vi0,vi+)f(vi0)Tf(vi+)𝔼p(vi0,vi1,vi2)f(vi1)Tf(vi2)+3δaug2.\mathbb{E}_{p(v_{i}^{0},v_{i}^{+})}f(v_{i}^{0})^{T}f(v_{i}^{+})\leq\mathbb{E}_{p(v_{i}^{0},v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})+3\delta_{aug}^{2}.

(c) f(vi0)Tf(vi1)1f(v_{i}^{0})^{T}f(v_{i}^{1})\leq 1, f(vi0)Tf(vi2)1f(v_{i}^{0})^{T}f(v_{i}^{2})\leq 1, and 𝔼p(vi1,vi2)f(vi1)Tf(vi2)2𝔼p(vi1,vi2)f(vi1)f(vi2)221𝔼p(vi1,vi2)(f(vi1)f(vi0)+f(vi0)f(vi2))2212δaug2\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})\geq\frac{2-\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||^{2}}{2}\geq 1-\frac{\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}(||f(v_{i}^{1})-f(v_{i}^{0})||+||f(v_{i}^{0})-f(v_{i}^{2})||)^{2}}{2}\geq 1-2\delta_{aug}^{2}.

Lemma A.1 ((Budimir et al., 2000) Corollary 3.5 (restated)).

Let g:mg:\mathbb{R}^{m}\rightarrow\mathbb{R} be a differentiable convex mapping and znz\in\mathbb{R}^{n}. Suppose that gg is LL- smooth with the constant L>0L>0, i.e. x,ym,g(x)g(y)Lxy\forall x,y\in\mathcal{R}^{m},\|\nabla g(x)-\nabla g(y)\|\leq L\|x-y\|. Then we have

0𝔼p(z)g(z)g(𝔼p(z)z)L[𝔼p(z)z2𝔼p(z)z2]=Lj=1nVar(z(j)),\displaystyle 0\leq\mathbb{E}_{p(z)}g(z)-g\left(\mathbb{E}_{p(z)}z\right)\leq L\left[\mathbb{E}_{p(z)}\lVert z\rVert^{2}-\|\mathbb{E}_{p(z)}z\|^{2}\right]=L\sum_{j=1}^{n}\operatorname*{Var}(z^{(j)}),

where z(j)z^{(j)} denotes the jj-th dimension of vv.

Lemma A.2 ((Wang et al., 2022b) Lemma A.2. restated).

For LSE:=log𝔼p(z)exp(f(v)g(z)){\rm LSE}:=\log\mathbb{E}_{p(z)}\exp(f(v)^{\top}g(z)), we denote its (biased) Monte Carlo estimate with MM random samples zip(z),i=1,,Mz_{i}\sim p(z),i=1,\dots,M as LSE^M=log1Mi=1Mexp(f(v)g(zi))\widehat{\rm LSE}_{M}=\log\frac{1}{M}\sum_{i=1}^{M}\exp(f(v)^{\top}g(z_{i})). Then the approvimation error A(M)A(M) can be upper bounded in expectation as

A(M):=𝔼p(v,zi)|LSE^(M)LSE|𝒪(M1/2).A(M):=\mathbb{E}_{p(v,z_{i})}|\widehat{\rm LSE}(M)-{\rm LSE}|\leq\mathcal{O}(M^{-1/2}).

We can see that the approvimation error converges to zero in the order of M1/2M^{-1/2}.

Λ2\displaystyle\Lambda_{2} =𝔼p(vi0)logj=1Kexp(f(vi0)Tμyj)\displaystyle=\mathbb{E}_{p(v_{i}^{0})}\log\sum_{j=1}^{K}\exp(f(v_{i}^{0})^{T}\mu_{y_{j}})
=𝔼p(vi0)log1Ki=jKexp(f(vi0)Tμyj)+logK\displaystyle=\mathbb{E}_{p(v_{i}^{0})}\log\frac{1}{K}\sum_{i=j}^{K}\exp(f(v_{i}^{0})^{T}\mu_{y_{j}})+\log K
=𝔼p(vi0)log𝔼p(yj)exp(f(vi0)Tμyj)+logK\displaystyle=\mathbb{E}_{p(v_{i}^{0})}\log\mathbb{E}_{p(y_{j})}\exp(f(v_{i}^{0})^{T}\mu_{y_{j}})+\log K
(d)𝔼p(vi1)log𝔼p(yj)exp(f(vi1)Tμyj)δaugej=1nVar(μj)+logK\displaystyle\overset{(d)}{\geq}\mathbb{E}_{p(v_{i}^{1})}\log\mathbb{E}_{p(y_{j})}\exp(f(v_{i}^{1})^{T}\mu_{y_{j}})-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})+\log K
(e)𝔼p(vi1)𝔼p(yi)log1Mj=1Mexp(f(vi1)Tμyj)A(M)+logKδaugej=1nVar(μj)\displaystyle\overset{(e)}{\geq}\mathbb{E}_{p(v_{i}^{1})}\mathbb{E}_{p(y_{i})}\log\frac{1}{M}\sum_{j=1}^{M}\exp(f(v_{i}^{1})^{T}\mu_{y_{j}})-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
=𝔼p(vi1)𝔼p(yi)log1Mj=1Mexp(𝔼p(vi|yi)f(vi1)Tf(vi))A(M)+logKδaugej=1nVar(μj)\displaystyle=\mathbb{E}_{p(v_{i}^{1})}\mathbb{E}_{p(y_{i})}\log\frac{1}{M}\sum_{j=1}^{M}\exp(\mathbb{E}_{p(v_{i}^{-}|y_{i}^{-})}f(v_{i}^{1})^{T}f(v_{i}^{-}))-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
(f)𝔼p(vi1)𝔼p(yi)𝔼p(vi|yi)log1Mi=1Mexp(f(vi1)Tf(vi))\displaystyle\overset{(f)}{\geq}\mathbb{E}_{p(v_{i}^{1})}\mathbb{E}_{p(y_{i})}\mathbb{E}_{p(v_{i}^{-}|y_{i}^{-})}\log\frac{1}{M}\sum_{i=1}^{M}\exp(f(v_{i}^{1})^{T}f(v_{i}^{-}))
12j=1mVar(fj(v|y))A(M)+logKδaugej=1nVar(μj)\displaystyle\hskip 15.00002pt-\frac{1}{2}\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
=𝔼p(vi1)𝔼p(yi)𝔼p(vi|yi)logi=1Mexp(f(vi1)Tf(vi))\displaystyle=\mathbb{E}_{p(v_{i}^{1})}\mathbb{E}_{p(y_{i})}\mathbb{E}_{p(v_{i}^{-}|y_{i}^{-})}\log\sum_{i=1}^{M}\exp(f(v_{i}^{1})^{T}f(v_{i}^{-}))
logM12j=1mVar(fj(v|y))A(M)+logKδaugej=1nVar(μj).\displaystyle\hskip 15.00002pt-\log M-\frac{1}{2}\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j}).

(d) We can show that: exp([f(v)Tμyj]\exp(\left[f(v)^{T}\mu_{y_{j}}\right] is convex, and uyju_{y_{j}} satisfy e-smooth,

exp(f(v)Ta)aexp(f(v)Tb)b\displaystyle\hskip 15.00002pt||\frac{\partial\exp(f(v)^{T}a)}{\partial a}-\frac{\partial\exp(f(v)^{T}b)}{\partial b}||
=||exp(f(v)Ta)f(v)exp(f(v)Tb)f(v))||\displaystyle=||\exp(f(v)^{T}a)f(v)-\exp(f(v)^{T}b)f(v))||
=|exp(f(v)Ta)exp(f(v)Tb)|f(v)\displaystyle=|\exp(f(v)^{T}a)-\exp(f(v)^{T}b)|\cdot||f(v)||
|exp(f(v)Ta)exp(f(v)Tb)|\displaystyle\leq|\exp(f(v)^{T}a)-\exp(f(v)^{T}b)|
e||(f(v)T)(ab)||(f(v)Ta,f(v)Tb1,so the biggest slope ise)\displaystyle\leq e||(f(v)^{T})(a-b)||\quad(f(v)^{T}a,f(v)^{T}b\leq 1,\text{so the biggest slope is}\ e)
eab.\displaystyle\leq e||a-b||.

So according to Lemma A.1, we get,

𝔼p(yj)exp([f(vi1)Tμyj])exp([f(vi1)T𝔼p(yj)μyj])+ej=1nVar(μj)=exp(f(vi1)Tμ)+ej=1nVar(μj).\begin{split}\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{1})^{T}\mu_{y_{j}}\right])&\leq\exp(\left[f(v_{i}^{1})^{T}\mathbb{E}_{p(y_{j})}\mu_{y_{j}}\right])+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})\\ &=\exp(f(v_{i}^{1})^{T}\mu)+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j}).\end{split}

Then, we can calculate the difference between log𝔼p(yj)exp([f(vi0)Tμyj])\log\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{0})^{T}\mu_{y_{j}}\right]) and log𝔼p(yj)exp([f(vi1)Tμyj])\log\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{1})^{T}\mu_{y_{j}}\right]) by applying reversed Jensen and Jensen inequality, respectively.

log𝔼p(yj)exp([f(vi1)Tμyj])log𝔼p(yj)exp([f(vi0)Tμyj])\displaystyle\hskip 15.00002pt\log\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{1})^{T}\mu_{y_{j}}\right])-\log\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{0})^{T}\mu_{y_{j}}\right])
log𝔼p(yj)exp([f(vi1)Tμyj])[f(vi0)Tμ]\displaystyle\leq\log\mathbb{E}_{p(y_{j})}\exp(\left[f(v_{i}^{1})^{T}\mu_{y_{j}}\right])-\left[f(v_{i}^{0})^{T}\mu\right]
log[exp(f(vi1)Tμ)+ej=1nVar(μj)][f(vi0)Tμ]\displaystyle\leq\log\left[\exp(f(v_{i}^{1})^{T}\mu)+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})\right]-\left[f(v_{i}^{0})^{T}\mu\right]
=log[exp(f(vi1)Tμ)]+log[1+ej=1nVar(μj)exp(f(vi1)Tμ)][f(vi0)Tμ]\displaystyle=\log\left[\exp(f(v_{i}^{1})^{T}\mu)\right]+\log\left[1+\frac{e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})}{\exp(f(v_{i}^{1})^{T}\mu)}\right]-\left[f(v_{i}^{0})^{T}\mu\right]
f(vi1)Tμf(vi0)Tμ+log[1+ej=1nVar(μj)](e2j=1nVar(μj), if not ReLU)\displaystyle\leq f(v_{i}^{1})^{T}\mu-f(v_{i}^{0})^{T}\mu+\log\left[1+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})\right]\quad(e^{2}\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})\text{, if not ReLU)}
(f(vi1)Tf(vi0)T)μ+ej=1nVar(μj)\displaystyle\leq(f(v_{i}^{1})^{T}-f(v_{i}^{0})^{T})\mu+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
(f(vi1)f(vi0))Tμf(vi1)f(vi0)(f(vi1)f(vi0))+ej=1nVar(μj)\displaystyle\leq(f(v_{i}^{1})-f(v_{i}^{0}))^{T}\frac{||\mu||}{||f(v_{i}^{1})-f(v_{i}^{0})||}(f(v_{i}^{1})-f(v_{i}^{0}))+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
(f(vi1)f(vi0))T1f(vi1)f(vi0)(f(vi1)f(vi0))+ej=1nVar(μj)\displaystyle\leq(f(v_{i}^{1})-f(v_{i}^{0}))^{T}\frac{1}{||f(v_{i}^{1})-f(v_{i}^{0})||}(f(v_{i}^{1})-f(v_{i}^{0}))+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
δaug+ej=1nVar(μj).\displaystyle\leq\delta_{aug}+e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j}).

(e) We adopt a Monte Carlo estimation with MM samples from p(y)p(y) and bound the approximation error with Lemma A.2.

(f) We also uses Lemma A.1, and as proof by Wang et al. (2022b), logsumexp is L-smooth for L=12L=\frac{1}{2}.

CE\displaystyle{\mathcal{L}}_{\mathrm{CE}} =Λ1+Λ2\displaystyle=\Lambda_{1}+\Lambda_{2}
𝔼p(v,y)f(vi1)Tf(vi2)3δaug2δaug𝔼p(v0,y)f(vy0)μy\displaystyle\geq-\mathbb{E}_{p(v,y)}f(v_{i}^{1})^{T}f(v_{i}^{2})-3\delta_{aug}^{2}-\delta_{aug}-\mathbb{E}_{p(v^{0},y)}||f(v_{y}^{0})-\mu_{y}||
+𝔼p(vi1)𝔼p(yi)𝔼p(vi|yi)logi=1Mexp(f(vi1)Tf(vi))\displaystyle\hskip 15.00002pt+\mathbb{E}_{p(v_{i}^{1})}\mathbb{E}_{p(y_{i})}\mathbb{E}_{p(v_{i}^{-}|y_{i})}\log\sum_{i=1}^{M}\exp(f(v_{i}^{1})^{T}f(v_{i}^{-}))
logM12j=1mVar(fj(v|y))A(M)+logKδaugej=1nVar(μj)\displaystyle\hskip 15.00002pt-\log M-\frac{1}{2}\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
=[𝔼p(vi1,vi2)f(vi1)Tf(vi2)+𝔼p(vi)logi=1Mexp(f(vi1)f(vi))]3δaug2δaug𝔼p(v0,y)f(vy0)μy\displaystyle=\left[-\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}f(v_{i}^{1})^{T}f(v_{i}^{2})+\mathbb{E}_{p(v_{i}^{-})}\log\sum_{i=1}^{M}\exp(f(v_{i}^{1})f(v_{i}^{-}))\right]-3\delta_{aug}^{2}-\delta_{aug}-\mathbb{E}_{p(v^{0},y)}||f(v_{y}^{0})-\mu_{y}||
logM12j=1mVar(fj(v|y))A(M)+logKδaugej=1nVar(μj)\displaystyle\hskip 15.00002pt-\log M-\frac{1}{2}\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))-A(M)+\log K-\delta_{aug}-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})
=NCE3δaug22δauglogMK12j=1mVar(fj(v|y))A(M)ej=1nVar(μj)𝔼p(v0,y)f(vy0)μy\displaystyle={\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-\log\frac{M}{K}-\frac{1}{2}\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))-A(M)-e\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})-\mathbb{E}_{p(v^{0},y)}||f(v_{y}^{0})-\mu_{y}||
(g)NCE3δaug22δauglogMK12Var(f(v+)|y)Var(f(v0)|y)O(M12)eVar(μy).\displaystyle\overset{(g)}{\geq}{\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-\log\frac{M}{K}-\frac{1}{2}\operatorname*{Var}(f(v^{+})|y)-\sqrt{\operatorname*{Var}(f(v^{0})|y)}-O(M^{-\frac{1}{2}})-e\operatorname*{Var}(\mu_{y}).

(g) This holds because, vv^{-} is randomly selected from v+v^{+} and,

j=1mVar(fj(v|y))\displaystyle\hskip 15.00002pt\sum_{j=1}^{m}\operatorname*{Var}(f_{j}(v^{-}|y))
=j=1m𝔼p(y)𝔼p(v|y)(fj(v+)𝔼p(v|y)fj(v))2\displaystyle=\sum_{j=1}^{m}\mathbb{E}_{p(y)}\mathbb{E}_{p(v|y)}(f_{j}(v^{+})-\mathbb{E}_{p(v^{\prime}|y)}f_{j}(v^{\prime}))^{2}
=𝔼p(y)𝔼p(v|y)j=1m(fj(v+)𝔼vfj(v))\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v|y)}\sum_{j=1}^{m}(f_{j}(v^{+})-\mathbb{E}_{v^{\prime}}f_{j}(v^{\prime}))
=𝔼p(y)𝔼p(v|y)f(v)𝔼vf(v)2\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v|y)}||f(v)-\mathbb{E}_{v^{\prime}}f(v^{\prime})||^{2}
=Var(f(v+)|y).\displaystyle=\operatorname*{Var}(f(v^{+})|y).

And similarly, we can get j=1nVar(μj)=Var(μy)\sum_{j=1}^{n}\operatorname*{Var}(\mu_{j})=\operatorname*{Var}(\mu_{y}). So the lower bound is proved.

A.3 Proof of Corollary 3.1

For Var(f(vy0|y))\operatorname*{Var}(f(v_{y}^{0}|y)), we can use augmentation distance and the intrinsic property of model and data to express.

Var(f(vy0|y))\displaystyle\operatorname*{Var}(f(v_{y}^{0}|y)) =𝔼p(y)𝔼p(vy0|y)f(vy0)μy2\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}||f(v_{y}^{0})-\mu_{y}||^{2}
=𝔼p(y)𝔼p(vy0|y)[(f(vy0)μy)T(f(vy0)μy))]\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}\left[(f(v_{y}^{0})-\mu_{y})^{T}(f(v_{y}^{0})-\mu_{y}))\right]
=𝔼p(y)𝔼p(vy0|y)[f(vy0)Tf(vy0)+μyTμy2f(vy0)Tμy]\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}\left[f(v_{y}^{0})^{T}f(v_{y}^{0})+\mu_{y}^{T}\mu_{y}-2f(v_{y}^{0})^{T}\mu_{y}\right]
𝔼p(y)𝔼p(vy0|y)[22f(vy0)Tμy]\displaystyle\leq\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}\left[2-2f(v_{y}^{0})^{T}\mu_{y}\right]
(h)𝔼p(y)𝔼p(vy0|y)[22(113δaug223δaugδy+12δy+2)]\displaystyle\overset{(h)}{\leq}\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}\left[2-2(1-\frac{1}{3}\delta_{aug}^{2}-\frac{2}{3}\delta_{aug}\delta_{y^{+}}-\frac{1}{2}\delta_{y^{+}}^{2})\right]
=𝔼p(y)𝔼p(vy0|y)[23δaug2+43δaugδy++δy+2)]\displaystyle=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{0}|y)}\left[\frac{2}{3}\delta_{aug}^{2}+\frac{4}{3}\delta_{aug}\delta_{y^{+}}+\delta_{y^{+}}^{2})\right]
23δaug2+43δaugLϵ0+L2ϵ02,\displaystyle\leq\frac{2}{3}\delta_{aug}^{2}+\frac{4}{3}\delta_{aug}L\epsilon_{0}+L^{2}\epsilon_{0}^{2},

where ϵ0=𝔼p(y)𝔼p(vi0,vj0|y)vi0vj0\epsilon_{0}=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{i}^{0},v_{j}^{0}|y)}||v_{i}^{0}-v_{j}^{0}|| and LL is the Lipschitz constant, so δy+2=𝔼p(y,i,j)f(vy,i0)f(vy,j0)2(Lϵ0)2\delta_{y^{+}}^{2}=\mathbb{E}_{p(y,i,j)}||f(v^{0}_{y,i})-f(v^{0}_{y,j})||^{2}\leq(L\epsilon_{0})^{2}.

Then we can easily get that,

Var(f(vy+)|y)=𝔼p(y)𝔼p(vy|y)f(vy+)μy2𝔼p(y)𝔼p(vy+|y)(f(vy+)f(vy0)+f(vy0)μy)2=𝔼p(y)𝔼p(vy+|y)||f(vy+)f(vy0)||2+𝔼p(y)𝔼p(vy+|y)||f(vy0)μy||)2+2𝔼p(y)𝔼p(vy+|y)f(vy+)f(vy0)f(vy0)μyδaug2+Var(f(vy0)|y)+2δaugVar(f(vy0)|y)=(δaug+Var(f(vy0)|y))2.\begin{split}\operatorname*{Var}(f(v_{y}^{+})|y)&=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{-}|y)}||f(v_{y}^{+})-\mu_{y}||^{2}\\ &\leq\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{+}|y)}(||f(v_{y}^{+})-f(v_{y}^{0})||+||f(v_{y}^{0})-\mu_{y}||)^{2}\\ &=\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{+}|y)}||f(v_{y}^{+})-f(v_{y}^{0})||^{2}+\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{+}|y)}||f(v_{y}^{0})-\mu_{y}||)^{2}\\ &\hskip 15.00002pt+2\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}^{+}|y)}||f(v_{y}^{+})-f(v_{y}^{0})||\cdot||f(v_{y}^{0})-\mu_{y}||\\ &\leq\delta_{aug}^{2}+\operatorname*{Var}(f(v_{y}^{0})|y)+2\delta_{aug}\sqrt{\operatorname*{Var}(f(v_{y}^{0})|y)}\\ &=(\delta_{aug}+\sqrt{\operatorname*{Var}(f(v_{y}^{0})|y)})^{2}.\end{split}

(h) We use Theorem 2.4.

And Var(μy)\operatorname*{Var}(\mu_{y}) can also be expressed by intrinsic properties.

Var(μy)\displaystyle\operatorname*{Var}(\mu_{y}) =𝔼p(y)μyμ2\displaystyle=\mathbb{E}_{p(y)}||\mu_{y}-\mu||^{2}
=𝔼p(y)μyf(vy)+f(vy)μ2\displaystyle=\mathbb{E}_{p(y)}||\mu_{y}-f(v_{y}^{*})+f(v_{y}^{*})-\mu||^{2}
𝔼p(y)(μyf(vy)+f(vy)μ)2\displaystyle\leq\mathbb{E}_{p(y)}(||\mu_{y}-f(v_{y}^{*})||+||f(v_{y}^{*})-\mu||)^{2}
=𝔼p(y)𝔼p(vy|y)f(vy)f(vy)2+𝔼p(y)f(vy)𝔼p(v)f(v)2\displaystyle=\mathbb{E}_{p(y)}||\mathbb{E}_{p(v_{y}|y)}f(v_{y})-f(v_{y}^{*})||^{2}+\mathbb{E}_{p(y)}||f(v_{y}^{*})-\mathbb{E}_{p(v)}f(v)||^{2}
+2𝔼p(y)(𝔼p(vy|y)f(vy)f(vy)f(vy)𝔼p(v)f(v))\displaystyle\hskip 15.00002pt+2\mathbb{E}_{p(y)}(||\mathbb{E}_{p(v_{y}|y)}f(v_{y})-f(v_{y}^{*})||\cdot||f(v_{y}^{*})-\mathbb{E}_{p(v)}f(v)||)
=𝔼p(y)𝔼p(vy|y)[f(vy)f(vy)]2+𝔼p(y)𝔼p(v)[f(vy)f(v)]2\displaystyle=\mathbb{E}_{p(y)}||\mathbb{E}_{p(v_{y}|y)}[f(v_{y})-f(v_{y}^{*})]||^{2}+\mathbb{E}_{p(y)}||\mathbb{E}_{p(v)}[f(v_{y}^{*})-f(v)]||^{2}
+2𝔼p(y)(𝔼p(vy|y)[f(vy)f(vy)]𝔼p(v)[f(vy)f(v)])\displaystyle\hskip 15.00002pt+2\mathbb{E}_{p(y)}(||\mathbb{E}_{p(v_{y}|y)}[f(v_{y})-f(v_{y}^{*})]||\cdot||\mathbb{E}_{p(v)}[f(v_{y}^{*})-f(v)]||)
𝔼p(y)𝔼p(vy|y)f(vy)f(vy)2+𝔼p(y)𝔼p(v)f(vy)f(v)2\displaystyle\leq\mathbb{E}_{p(y)}\mathbb{E}_{p(v_{y}|y)}||f(v_{y})-f(v_{y}^{*})||^{2}+\mathbb{E}_{p(y)}\mathbb{E}_{p(v)}||f(v_{y}^{*})-f(v)||^{2}
+2𝔼p(y)(𝔼p(vy|y)f(vy)f(vy)f(vy)f(v))\displaystyle\hskip 15.00002pt+2\mathbb{E}_{p(y)}(\mathbb{E}_{p(v_{y}|y)}||f(v_{y})-f(v_{y}^{*})||\cdot||f(v_{y}^{*})-f(v)||)
L2ϵ12+L2ϵ22+2L2ϵ1ϵ2\displaystyle\leq L^{2}\epsilon_{1}^{2}+L^{2}\epsilon_{2}^{2}+2L^{2}\epsilon_{1}\epsilon_{2}
=L2(ϵ1+ϵ2)2,\displaystyle=L^{2}(\epsilon_{1}+\epsilon_{2})^{2},

where vyv_{y}^{*} could be any node of class yy, and ϵ1=𝔼p(v,y)vyvy\epsilon_{1}=\mathbb{E}_{p(v,y)}||v_{y}-v_{y}^{*}||, ϵ2=𝔼p(y)𝔼p(v)vvy\epsilon_{2}=\mathbb{E}_{p(y)}\mathbb{E}_{p(v)}||v-v_{y}^{*}||.

^CE\displaystyle\hat{{\mathcal{L}}}_{\mathrm{CE}} NCE3δaug22δauglogMK12Var(f(v)|y)Var(f(v0)|y)O(M12)eVar(μy)\displaystyle\geq{\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-\log\frac{M}{K}-\frac{1}{2}\operatorname*{Var}(f(v^{-})|y)-\sqrt{\operatorname*{Var}(f(v^{0})|y)}-O(M^{-\frac{1}{2}})-e\operatorname*{Var}(\mu_{y})
NCE3δaug22δauglogMK12(δaug+Var(f(vy0)|y))2Var(f(vy0)|y)O(M12)eL2(ϵ1+ϵ2)2\displaystyle\geq{\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-\log\frac{M}{K}-\frac{1}{2}(\delta_{aug}+\sqrt{\operatorname*{Var}(f(v_{y}^{0})|y)})^{2}-\sqrt{\operatorname*{Var}(f(v_{y}^{0})|y)}-O(M^{-\frac{1}{2}})-eL^{2}(\epsilon_{1}+\epsilon_{2})^{2}
=NCE3δaug22δauglogMK12δaug2(δaug+1)Var(f(vy0)|y))\displaystyle={\mathcal{L}}_{\mathrm{NCE}}-3\delta_{aug}^{2}-2\delta_{aug}-log\frac{M}{K}-\frac{1}{2}\delta_{aug}^{2}-(\delta_{aug}+1)\sqrt{\operatorname*{Var}(f(v_{y}^{0})|y))}
12Var(f(vy0)|y))O(M12)eL2(ϵ1+ϵ2)2\displaystyle\hskip 15.00002pt-\frac{1}{2}\operatorname*{Var}(f(v_{y}^{0})|y))-O(M^{-\frac{1}{2}})-eL^{2}(\epsilon_{1}+\epsilon_{2})^{2}
=NCEg(δaug)logMKO(M12),\displaystyle={\mathcal{L}}_{\mathrm{NCE}}-g(\delta_{aug})-\log\frac{M}{K}-O(M^{-\frac{1}{2}}),

where g(δaug)=236δaug2+12L2ϵ02+eL2(ϵ1+ϵ2)2+2δaug+23δaugLϵ0+(δaug+1)23δaug2+43δaugLϵ0+L2ϵ02g(\delta_{aug})=\frac{23}{6}\delta_{aug}^{2}+\frac{1}{2}L^{2}\epsilon_{0}^{2}+eL^{2}(\epsilon_{1}+\epsilon_{2})^{2}+2\delta_{aug}+\frac{2}{3}\delta_{aug}L\epsilon_{0}+(\delta_{aug}+1)\sqrt{\frac{2}{3}\delta_{aug}^{2}+\frac{4}{3}\delta_{aug}L\epsilon_{0}+L^{2}\epsilon_{0}^{2}}.

According to Oord et al. (2018), we get,

I(f(vi1);f(vi2))log(M)NCE,NCElog(M)I(f(vi1);f(vi2)).\begin{split}&I(f(v_{i}^{1});f(v_{i}^{2}))\geq\log(M)-{\mathcal{L}}_{\mathrm{NCE}},\\ &{\mathcal{L}}_{\mathrm{NCE}}\geq\log(M)-I(f(v_{i}^{1});f(v_{i}^{2})).\end{split}

Therefore, we can reformulate Theorem 2.6 as below:

^CElog(M)I(f(vi1);f(vi2))g(δaug)logMKO(M12)=log(K)I(f(vi1);f(vi2))g(δaug)O(M12).\begin{split}\hat{{\mathcal{L}}}_{\mathrm{CE}}&\geq\log(M)-I(f(v_{i}^{1});f(v_{i}^{2}))-g(\delta_{aug})-\log\frac{M}{K}-O(M^{-\frac{1}{2}})\\ &=\log(K)-I(f(v_{i}^{1});f(v_{i}^{2}))-g(\delta_{aug})-O(M^{-\frac{1}{2}}).\end{split}

A.4 Proof of Corollary 3.3

Corallary 3.3 could be simply proved below:

𝔼p(vi1,vi2)f(vi1)f(vi2)2=𝔼p(vi1,vi2)[(f(vi1)Tf(vi2)T)(f(vi1)f(vi2))]=𝔼p(vi1,vi2)[22f(vi1)Tf(vi2)]=22Ntr((H1)TH2)=(1)22Niθiλiλi′′.\begin{split}\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||^{2}&=\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}[(f(v_{i}^{1})^{T}-f(v_{i}^{2})^{T})(f(v_{i}^{1})-f(v_{i}^{2}))]\\ &=\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}[2-2f(v_{i}^{1})^{T}f(v_{i}^{2})]\\ &=2-\frac{2}{N}tr((H^{1})^{T}H^{2})\\ &\overset{(1)}{=}2-\frac{2}{N}\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}.\\ \end{split}

So (𝔼p(vi1,vi2)f(vi1)f(vi2))2𝔼p(vi1,vi2)f(vi1)f(vi2)2=22Niθiλiλi′′(\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||)^{2}\leq\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||^{2}=2-\frac{2}{N}\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}, then 𝔼p(vi1,vi2)f(vi1)f(vi2)22Niθiλiλi′′\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||\leq\sqrt{2-\frac{2}{N}\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}}.

(1) is suggested by Liu et al. (2022), tr((H1)TH2)tr((H^{1})^{T}H^{2}) could be represented as iθiλiλi′′\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}.

As we know that,

𝔼p(vi1,vi2)f(vi1)f(vi2)𝔼p(vi1,vi2)(f(vi1)f(vi0)+f(vi0)f(vi2))2δaug.\begin{split}\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||\leq\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}(||f(v_{i}^{1})-f(v_{i}^{0})||+||f(v_{i}^{0})-f(v_{i}^{2})||)\leq 2\delta_{aug}.\end{split}

Then, we can get:

2δaug𝔼p(vi1,vi2)f(vi1)f(vi2)22Niθiλiλi′′.2\delta_{aug}\geq\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}||f(v_{i}^{1})-f(v_{i}^{2})||\geq\sqrt{2-\frac{2}{N}\sum_{i}\theta_{i}\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}}. (6)

A.5 Proof of Lemma B.1

From Stewart (1990), we know the following equation:

Δλi=λiλi=uiTΔAuiλiuiTΔDui+O(ΔA).\Delta\lambda_{i}=\lambda_{i}^{{}^{\prime}}-\lambda_{i}=u_{i}^{T}\Delta Au_{i}-\lambda_{i}u_{i}^{T}\Delta Du_{i}+O(||\Delta A||).

So we can calculate the difference between λi,λi′′\lambda_{i}^{\prime},\lambda_{i}^{\prime\prime} and λi\lambda_{i},

Δλi\displaystyle\Delta\lambda_{i} =m(nui[n]ΔA[m][n])ui[m]λimui[m]ΔD[m]ui[m]+O(ΔA)\displaystyle=\sum_{m}(\sum_{n}u_{i}\left[n\right]\Delta A\left[m\right]\left[n\right])u_{i}\left[m\right]-\lambda_{i}\sum_{m}u_{i}\left[m\right]\Delta D\left[m\right]u_{i}\left[m\right]+O(||\Delta A||)
=m,nui[m]ui[n]ΔA[m][n]λim,nui[m]ΔA[m][n]ui[m]+O(ΔA).\displaystyle=\sum_{m,n}u_{i}\left[m\right]u_{i}\left[n\right]\Delta A\left[m\right]\left[n\right]-\lambda_{i}\sum_{m,n}u_{i}\left[m\right]\Delta A\left[m\right]\left[n\right]u_{i}\left[m\right]+O(||\Delta A||).

And we can directly calculate λiλi′′\lambda_{i}^{\prime}-\lambda_{i}^{\prime\prime} as below:

λiλi′′\displaystyle\lambda_{i}^{\prime}-\lambda_{i}^{\prime\prime} =ΔλiΔλi′′\displaystyle=\Delta\lambda_{i}^{\prime}-\Delta\lambda_{i}^{\prime\prime}
=m,nui[m]ui[n]ΔA^[m][n]λim,nui[m]ΔA^[m][n]ui[m]\displaystyle=\sum_{m,n}u_{i}\left[m\right]u_{i}\left[n\right]\Delta\widehat{A}\left[m\right]\left[n\right]-\lambda_{i}\sum_{m,n}u_{i}\left[m\right]\Delta\widehat{A}\left[m\right]\left[n\right]u_{i}\left[m\right]
=m,nui[m]ΔA^[m][n](ui[n]λiui[m]).\displaystyle=\sum_{m,n}u_{i}\left[m\right]\Delta\widehat{A}\left[m\right]\left[n\right](u_{i}\left[n\right]-\lambda_{i}u_{i}\left[m\right]).

Appendix B GCL Methods with Spatial and Spectral Augmentation

B.1 Spatial Augmentation

Most augmentation methods are applied to explicitly or implicitly increase mutual information while maintain high augmentation distance. GRACE simply adjusts this by changing the drop rate of features and edges. AD-GCL (Suresh et al., 2021) directly uses the optimization objective min{aug}max{fF}I(f(v);f(aug(v)))min_{\{aug\}}max_{\{f\in F\}}I(f(v);f(aug(v))) to search for a stronger augmentation.

Refer to caption
Figure 4: influence of pτp_{\tau} on Cora (all the data are normalized for better visualization)

And GCA (Zhu et al., 2021) could always perform better than random drop. This is mainly because GCA calculates node importance and masks those unimportant to increase mutual information. Also they use pτp_{\tau} as a cut-off probability, so for those unimportant features/edges, all of them share the same drop probability pτp_{\tau}. By setting a large pτp_{\tau}, GCA can reduce the drop probability for the least important features/edges and drop more relatively important ones to achieve a trade-off between mutual information and augmentation distance.

From Figure 4, we could clearly see that, as pτp_{\tau} increases, augmentation distance and NCE{\mathcal{L}}_{\mathrm{NCE}} are increasing, and leads to a better downstream performance, than when pτp_{\tau} becomes too large, we got a trivial solution. And in the code provided by the author, pτp_{\tau} is set to 0.7. So GCA performances well on downstream tasks not only because its adaptive augmentation, but also its modification on augmentation distance.

B.2 Spectral Augmentation

Furthermore, we can demonstrate that lots of spectral augmentations follow this schema to improve downstream performance. Liu et al. (2022) proposes that increasing the number of high-frequency drops leads to better performance. This is because high-frequency parts are associated with higher coefficients λi\lambda_{i}, so increasing the number of high-frequency drops can have a stronger increasement on δaug\delta_{aug}, resulting in better performance.

Lemma B.1 (Change of Spectrum).

if we assume that A=A+ΔA1A^{\prime}=A+\Delta A_{1}, A′′=A+ΔA2A^{\prime\prime}=A+\Delta A_{2}, λi\lambda_{i}^{\prime}, λi′′\lambda_{i}^{\prime\prime} is the ithi^{th} eigenvalue of AA^{\prime} and A′′A^{\prime\prime}, respectively. ΔA^=AA′′\Delta\hat{A}=A^{\prime}-A^{\prime\prime}, and uiu_{i} is the corresponding eigenvector.

λiλi′′=m,nui[m]ΔA^[m][n](ui[n]λiui[m]).\begin{split}\lambda_{i}^{\prime}-\lambda_{i}^{\prime\prime}=\sum_{m,n}u_{i}\left[m\right]\Delta\widehat{A}\left[m\right]\left[n\right](u_{i}\left[n\right]-\lambda_{i}u_{i}\left[m\right]).\end{split}

Lemma B.1 is proved in Appendix A.5. Lin et al. (2022) propose to maximize the spectral difference between two views, but Lemma B.1 shows that difference between spectrum is highly correlated with the original magnitude, so it is actually encouraging more difference in large |λi||\lambda_{i}|. But rather than just drop information, they try to improve the spectrum of first view, and decrease the other view. if we simply assume λi=λi+n\lambda_{i}^{\prime}=\lambda_{i}+n, λi′′=λin\lambda_{i}^{\prime\prime}=\lambda_{i}-n, then λiλi′′=λi2n2λi2\lambda_{i}^{\prime}\lambda_{i}^{\prime\prime}=\lambda_{i}^{2}-n^{2}\leq\lambda_{i}^{2}, so this could also be explained by augmentation distance increasement.

Appendix C Further Explanation

C.1 Value of θ\thetas

As defined by Liu et al. (2022), θ\thetas are actually linear combination of the eigenvalues of adjacency matrix 𝑨{\bm{A}}. To demonstrate what θ\thetas actually are, we first focus on the assumption below.

Assumption C.1 (High-order Proximity).

𝑴=w0+w1𝑨+w2𝑨𝟐++wq𝑨𝒒\bm{M}=w_{0}+w_{1}\bm{A}+w_{2}\bm{A^{2}}+\dots+w_{q}\bm{A^{q}}, where 𝑴=X1WWT(X2)T\bm{M}=X^{1}W\cdot W^{T}(X^{2})^{T}, 𝑨𝒊\bm{A^{i}} means matrix multiplications between ii 𝑨\bm{A}s, and wiw_{i} is the weight of that term.

Where X1,X2X^{1},X^{2} indicates the feature matrix of graph 𝒢1,𝒢2{\mathcal{G}}^{1},{\mathcal{G}}^{2}, WW stands for the parameter of the model, so 𝑴=X1WWT(X2)T\bm{M}=X^{1}W\cdot W^{T}(X^{2})^{T} means embedding similarity between two views, and could be roughly represented by the weighted sum of different orders of 𝑨\bm{A}. Furthermore, we have that:

{w0+w1λ1++wqλ1q=θ1w0+w1λ2++wqλ2q=θ2w0+w1λN++wqλNq=θN,\left\{\begin{aligned} w_{0}+w_{1}\lambda_{1}+&\dots+w_{q}\lambda_{1}^{q}=\theta_{1}\\ w_{0}+w_{1}\lambda_{2}+&\dots+w_{q}\lambda_{2}^{q}=\theta_{2}\\ &\dots\\ w_{0}+w_{1}\lambda_{N}+&\dots+w_{q}\lambda_{N}^{q}=\theta_{N},\\ \end{aligned}\right.\\

where λ1,,λN\lambda_{1},...,\lambda_{N} is N eigenvalues of the adjacency matrix 𝑨{\bm{A}}.

Refer to caption
Figure 5: Percentage of positive θ\theta

So we know that θ\thetas are actually linear combination of λ\lambdas. As the model is trained to minimize NCE{\mathcal{L}}_{\mathrm{NCE}}, θ\thetas are expected to increase, and we can simply set w0,w2,,w2(q/2)w_{0},w_{2},...,w_{2(q/2)} to be positive and other wiw_{i} to 0, then we can get θ\thetas that are all positive, and the model would easily find better wws.

We can say that in the training process, θ\thetas are mostly positive, and the experiments shown in Figure 10 indicate it true.

Refer to caption
Figure 6: relationship of PCD, NCD and performance on images

C.2 PCS, NCS and Downstream Performance

More experiments are conducted on various of datasets to show that our finding could be generalized rather that limited to few datasets in Figure 7. They show similar tendency that with the dropout rate increasing, the downstream accuracy increases first and decreases when the augmentation is too strong. And those experiments show that when the downstream accuracy increases, the positive center distance are sometimes increasing, and the better downstream performance is mainly caused by the increasing distance of negative center.

Refer to caption
Figure 7: More experiments on PCS and NCS, the detailed data is slightly different due to randomness, but it shows similar tendency

We also conduct experiments on images to verify our theory, we control the magnitude of augmentation by adjusting the color distortion strength, and the results are normalized by Min-Max normalization. From Figure 6, we can observe that the downstream performance is also closely correlated with negative center distance especially when the color distortion strength changes from 0.2 to 0.6 the positive center distance increases while downstream performance is increasing, but when color distortion is greater than 0.6 the positive center distance also tends to decrease. This aligns with our finding in Theorem 2.4 that with the augmentation gets stronger the negative center distance is increasing while the positive center distance does not change in specific pattern. Also the color distortion is not strong enough to change the label information, so the downstream performance keeps increasing with stronger augmentation.

C.3 Change of δaug\delta_{aug} and Label Consistency

Refer to caption
Figure 8: relationship between δaug\delta_{aug}, KL divergence and augmentation

To verify how is δaug\delta_{aug} changing with stronger augmentation, we use drop rate of edges/features as data augmentation, and find that when the drop rate increases, δaug\delta_{aug} also tends to increase. Also to verify the view invariance assumption, we first train a well conditioned model and use its prediction as p(vi)p(v_{i}), then we change the drop rate and calculate new p(vi)p^{\prime}(v_{i}), then we can observe from Figure 8 that though the KL divergence is increasing with drop rate, it remains quite small magnitude, so the label is consistent with data augmentation.

C.4 Augmentation Free Methods

In this paper, we mainly discuss how the augmentation will affect the contrastive performance, but actually, GCL methods with or without augmentation aim for the same, they both try to align intra-class nodes and separate inter-class nodes. However, during contrastive learning, label information is not accessible, so they use different methods to get intra-class nodes.

  • GCL methods with augmentation create intra-class nodes by data augmentation, so it is necessary to control the strength of augmentation to ensure label consistency. But augmentation brings more flexibility, you can freely change the topology and feature of the graph, so a good GCL method with augmentation always require a well-designed data augmentation. This could lead to great performance, but the they require more time consumption and overlook the unique properties of graphs.

  • GCL methods without augmentation instead find intra-class nodes by other methods. For example, GMI (Peng et al., 2020) and iGCL (Li et al., 2023) try to align the anchor with its neighbors and similar nodes (which are more likely to hold the same label), and SUGRL (Mo et al., 2022) create intra-class nodes by two different embedding generation methods. Label based methods like SupCon (Khosla et al., 2020) directly align samples with the same class. These methods take advantage of the inherent property of the dataset such as homophily and the similarity between intra-class samples but the positive sample construction is not as flexible as augmentation.

Therefore, GCL methods with or without methods are inherently the same, they both align positive samples, and they create the positive samples differently. Our analysis focus on the difference between two positive samples, so the analysis can also be employed on those methods.

C.5 change on positive/negative pair similarity

The InfoNCE loss NCE{\mathcal{L}}_{\mathrm{NCE}} can be written as NCE=𝔼p(vi1,vi2)𝔼p(vi)[logexp(f(vi1)Tf(vi2)){vi}exp(f(vi1)Tf(vi))]{\mathcal{L}}_{\mathrm{NCE}}=\mathbb{E}_{p(v_{i}^{1},v_{i}^{2})}\mathbb{E}_{p({v_{i}^{-}})}\left[-\log\frac{\exp(f(v_{i}^{1})^{T}f(v_{i}^{2}))}{\sum_{\{v^{-}_{i}\}}\exp(f(v_{i}^{1})^{T}f(v^{-}_{i}))}\right], and when we perform stronger augmentation, f(vi1)Tf(vi2)f(v_{i}^{1})^{T}f(v_{i}^{2}) would be hard to maximize, and the model will try to minimize f(vi1)Tf(vi)f(v_{i}^{1})^{T}f(v^{-}_{i}) harder. From Figure 9, when the augmentation gets stronger, negative and positive pair similarity both decreases, so the class separating performance is enhanced.

Refer to caption
Figure 9: sim+\mathrm{sim+} represents the positive pair similarity f(vi1)Tf(vi2)f(v_{i}^{1})^{T}f(v_{i}^{2}), and sim\mathrm{sim-} is negative pair similarity f(vi1)Tf(vi)f(v_{i}^{1})^{T}f(v_{i}^{-}), the x-axis stands for dropout rate on edges

Appendix D Experiments

D.1 Datasets and Experimental Details

We choose the six commonly used Cora, CiteSeer, PubMed, DBLP, Amazon-Photo and Amazon-Computer for evaluation. The first four datasets are citation networks (Sen et al., 2008; Yang et al., 2016; Bojchevski & Günnemann, 2017), where nodes represent papers, edges are the citation relationship between papers, node features are comprised of bag-of-words vector of the papers and labels represent the fields of papers. In Amazon-Photos and Amazon-Computers (Shchur et al., 2018), nodes represent the products and edges means that the two products are always bought together, the node features are also comprised of bag-of words vector of comments, labels represent the category of the product.

We use 2 layers of GCNConv as the backbone of encoder, we use feature/edge drop as data augmentation, the augmentation is repeated randomly every epoch, and InfoNCE loss is conducted and optimized by Adam. After performing contrastive learning, we use logistic regression for downstream classification the solver is liblinear, and in all 6 datasets we randomly choose 10% of nodes for training and the rest for testing.

Table 2: Dataset statistics
Dataset Nodes Edges Features Classes
Cora 2,708 5,429 1,433 7
Citeseer 3,327 4,732 3,703 6
Pubmed 19,717 44,338 500 3
DBLP 17,716 105,734 1,639 4
Amazon-Photo 7,650 119,081 745 8
Amazon-Computers 13,752 245,861 767 10
Table 3: Dataset download links
Dataset Download Link
Cora https://github.com/kimiyoung/planetoid/raw/master/data
Citeseer https://github.com//kimiyoung/planetoid/raw/master/data
Pubmed https://github.com/kimiyoung/planetoid/raw/master/data
DBLP https://github.com/abojchevski/graph2gauss/raw/master/data/dblp.npz
Amazon-Photo https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_photo.npz
Amazon-Computers https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_computers.npz

And the publicly available implementations of Baselines can be found at the following URLs:

D.2 Hyperparameter Setting

Table 4: Hyperparameters settings
Dataset Learning rate Weight decay num layers τ\tau Epochs Hidden dim Activation
Cora 545^{-4} 10610^{-6} 2 0.4 200 128 ReLU
Citeseer 10410^{-4} 10610^{-6} 2 0.9 200 256 PReLU
Pubmed 10410^{-4} 10610^{-6} 2 0.7 200 256 ReLU
DBLP 10410^{-4} 10610^{-6} 2 0.7 200 256 ReLU
Amazon-Photo 10410^{-4} 10610^{-6} 2 0.3 200 256 ReLU
Amazon-Computers 10410^{-4} 10610^{-6} 2 0.2 200 128 RReLU

The hyperparameter settings is shown in Table 4, other hyperparameter correlated to only one algorithm are set the same as the original author. The hyperparameter in our methods retain rate ξ\xi and spectrum change magnitude α\alpha, we select them from 0.05 to 0.45 and from -0.1 to 0.01, respectively.

D.3 Changes on the Spectrum

Refer to caption
Figure 10: As we perform spectrum augmentation each 10 epochs, the x-axis is epoch/10, the y-axis of the left figure is number of decreasing λ\lambdas minus number of increasing λ\lambdas; for the middle one, y-axis stands for how much λ\lambdas averagely decreases; and the right one is the average value of λ\lambda.

From Figure 10(a), we can see that, when the algorithm is training, θ\thetas are mostly increasing gradually, and when we perform spectrum augmentation, θ\thetas will not increase as before, increasing number of θ\theta is close even smaller to decreasing ones. Then we take a step back on those decreasing ones, result in increasing θ\thetas again in the next epoch. Therefore, what we do is actually perform augmentation to maximize augmentation distance first, then maximize the mutual information after spectrum augmentation. The idea is actually similar AD-GCL, but we use θ\thetas to indicate whether the augmentation is too much, so we get a more reasonable result. Figure 10(b) and (c) shows that as the training goes, the change on larger magnitude eigenvalues are also more significant, causing the spectrum to be smoother.

Also there is one thing to notice that when we perform spectrum smoothen method, we are indirectly changing the edge weights, causing the augmentation being weaker or stronger as drop an edge with weight of 1 is different than drop an edge with weight 1+noise1+noise. To reduce its influence, we conduct extra augmentation or recovery based on the average weight change.

D.4 Center Distance

As we mentioned earlier, GCL mainly contributes to downstream tasks by increasing the negative center distance while maintaining a relatively small distance to the positive center. We propose two methods: one that increases mutual information between two views while keeping a high augmentation distance by masking more unimportant edges or features. This allows the model to learn more useful information, which forces nodes close to its positive center. The other method tries to increase augmentation distance while maintaining a relatively high mutual information, so it may not learn as much useful information. However, by increasing the augmentation distance, it forces the model to separate nodes from different classes further apart. In summary, the first method brings nodes of the same class closer together, while the second method separates nodes from different classes further apart just as shown in Figure 11.

Refer to caption
Figure 11: distance of nodes between its positive center and negative center, GRACE stands for the pure GRACE, GRACEI stands for GRACE with information augmentation, and GRACES stands for GRACE with spectrum augmentation
Refer to caption
Figure 12: The error bar of algorithms

D.5 Hyperparameter Sensitivity

Refer to caption
Figure 13: accuracy on downstream tasks with different retain rate
Refer to caption
Figure 14: Accuracy on downstream tasks with different α\alpha

Analysis of retain rate. Retain rate controls how many important features/edges we kept, and how many unimportant ones dropped. We can see from Figure 13 that AD-GCL benefits from a larger retain rate as it is designed to minimize the mutual information, and lots of vital structures are dropped. And large datasets like PubMed, DBLP benefits less, it is mainly because a graph with more edges are more likely to maintain enough information than graph with little edges. For example, after a 30% dropout on edges, a graph with 1000 edges would still kept enough information for downstream tasks, but a graph with 10 edges would probably lose some vital information.

Analysis of α\alpha. α\alpha controls how much |λ||\lambda| will decrease, as we take a step back when the |λ||\lambda| decreases too much, the hyperparameter α\alpha does not matter so much. But as shown in Figure 14, it still performs more steady on large graphs as a wrong modification on a single λ\lambda matters less than on small graphs.

D.6 Time Complexity and Error Bar

Table 5: The time consumption (seconds) of algorithms
Cora CiteSeer PubMed DBLP Amazon-P Amazon-C
GRACE 8.02 10.08 62.37 56.89 19.05 28.71
GRACE+I 10.74 13.49 68.97 62.8 22.67 29.61
GRACE+S 9.61 12.46 78.11 69.44 21.13 36.94

From Table 5, we can observe that the information augmentation method achieve better performance with only few more time consuming, this is mainly because we do not calculate the importance of features/edges every epoch like GCS (Wei et al., 2023), we only calculate it once and use the same importance for the following training. However, the spectrum augmentation method consumes more time on large graphs like PubMed and DBLP, this is mainly we explicitly change the spectrum and calculate the new adjacency matrix, which could be replaced by some approximation methods but to prevent interference from random noise and show that Theorem 3.2 is meaningful, we still conduct eigen decomposition, but it is worth mentioning that the time complexity could be reduced by some approximation methods (Liu et al., 2022).

The error bar is reported in Figure 12, the experiments are conducted repeatedly for 10 times, we can observe that both the information augmentation and spectrum augmentation achieve better results, and they performs stably.

D.7 Combination of Information&Spectrum Augmentation

Table 6: Combine the information&Spectrum Augmentation (GRACE+IS)
Cora CiteSeer PubMed DBLP Amazon-P Amazon-C
GRACE 82.52±0.75 70.44±1.49 84.97±0.17 84.01±0.34 91.17±0.15 86.36±0.32
GRACE+I 83.78±1.08 72.89±0.97 84.97±0.14 84.80±0.17 91.64±0.21 84.54±0.53
GRACE+S 83.61±0.85 72.83±0.47 85.45±0.25 84.83±0.18 91.99±0.35 87.67±0.33
GRACE+IS 84.58±0.79 72.94±0.52 85.62±0.17 84.87±0.25 92.04±0.32 87.73±0.41

We combine the information augmentation and spectrum augmentation methods and show the result in Table 6. We can observe that combine the two methods achieve the best performance. We can observe that for larger and denser graphs, the information could still be well-preserved even after strong augmentation, rendering the augmentation less powerful compared to smaller graphs. And the spectrum augmentation modify the spectrum based on InfoNCE loss which will be more stable on larger graphs, so it help more significantly in larger graphs.

Appendix E Related Work

Graph Contrastive Learning. Graph Contrastive Learning has shown its superiority and lots of researcher are working on it. DGI (Veličković et al., 2018) contrasts between local node embeddnig and the global summary vector; GRACE (Zhu et al., 2020), GCA (Zhu et al., 2021) and GraphCL (You et al., 2020) randomly drop edges and features; AD-GCL (Suresh et al., 2021) and InfoGCL Xu et al. (2021) learn an adaptive augmentation with the help of different principles. In theoretical perspective, Liu et al. (2022) correlates the InfoNCE loss with graph spectrum, and propose that augmentation should be more focused on high frequency parts. Guo et al. (2023) further discuss that contrastive learning in graph is different with images. Lin et al. (2022) thinks that augmentation maximize the spectrum difference would help, and Yuan et al. (2022) analyse GCL with information theory.

Contrastive Learning Theory. By linking downstream classification and contrastive learning objectives, Arora et al. (2019) propose a theoretical generalization guarantee. Ash et al. (2021) further explore how does the number of negative samples influence the generalization. And Tian et al. (2020); Wang et al. (2022a) further discuss what kind of augmentation is better for downstream performance. Then Wang & Isola (2020) propose that perfect alignment and uniformity is the key to success while Wang et al. (2022b) argues augmentation overlap with alignment helps gathering intra-class nodes by stronger augmentation. However, Saunshi et al. (2022) show that augmentation overlap is actually quite rare while the downstream performance is satisfying. So the reason why contrastive learning helps remains a mystery, in this paper we propose that the stronger augmentation mainly helps contrastive learning by separating inter-class nodes, and different from previous works (Wang et al., 2022b; Wang & Isola, 2020; Huang et al., 2021), we do not treat perfect alignment as key to success, instead a stronger augmentation that draw imperfect alignment could help.