Rethinking Spectral Augmentation for Contrast-based Graph Self-Supervised Learning

Xiangru Jian
Department of Computer Science
University of Waterloo
[email protected]
&Xinjian Zhao¹¹footnotemark: 1
School of Data Science
The Chinese University of Hong Kong, Shenzhen
[email protected]
Wei Pang¹¹footnotemark: 1
Department of Computer Science
University of Waterloo
Vector Institute
[email protected]
&Chaolong Ying
School of Data Science
The Chinese University of Hong Kong, Shenzhen
[email protected]
&Yimu Wang
Department of Computer Science
University of Waterloo
[email protected]
&Yaoyao Xu
School of Data Science
The Chinese University of Hong Kong, Shenzhen
[email protected]
&Tianshu Yu
School of Data Science
The Chinese University of Hong Kong, Shenzhen
[email protected]
Xiangru Jian, Xinjian Zhao, and Wei Pang contributed equally to this paper.Corresponding author

Abstract

The recent surge in contrast-based graph self-supervised learning has prominently featured an intensified exploration of spectral cues. Spectral augmentation, which involves modifying a graph’s spectral properties such as eigenvalues or eigenvectors, is widely believed to enhance model performance. However, an intriguing paradox emerges, as methods grounded in seemingly conflicting assumptions regarding the spectral domain demonstrate notable enhancements in learning performance. Through extensive empirical studies, we find that simple edge perturbations - random edge dropping for node-level and random edge adding for graph-level self-supervised learning - consistently yield comparable or superior performance while being significantly more computationally efficient. This suggests that the computational overhead of sophisticated spectral augmentations may not justify their practical benefits. Our theoretical analysis of the InfoNCE loss bounds for shallow GNNs further supports this observation. The proposed insights represent a significant leap forward in the field, potentially refining the understanding and implementation of graph self-supervised learning.

1 Introduction

In recent years, graph learning has emerged as a powerhouse for handling complex data relationships in multiple fields, offering vast potential and value, particularly in domains such as data mining [12], computer vision [41], network analysis [6], and bioinformatics [15]. However, limited labels make graph learning challenging to apply in real-world scenarios. Inspired by the great success of Self-Supervised Learning (SSL) in other domains [8, 5], Graph Self-Supervised Learning (Graph SSL) has made rapid progress and has shown promise by achieving state-of-the-art performance on many tasks [40], where Contrast-based Graph SSL (CG-SSL) are most dominant [23]. This type of method is grounded in the concept of mutual information (MI) maximization. The primary goal is to maximize the estimated MI between augmented instances of the same object, such as nodes, subgraphs, or entire graphs. Among the new developments in CG-SSL, approaches inspired by graph spectral methods have garnered significant attention. A prevalent conviction is that spectral information, including the eigenvalues and eigenvectors of the graph’s Laplacian, plays a crucial role in enhancing the efficacy of CG-SSL [21, 17, 19, 42, 4].

In general, methods in CG-SSL can be categorized into two types based on whether augmentation is performed on the input graph to generate different views [4]. i.e. augmentation-based and augmentation-free methods. Of the two, the augmentation-based methods are more predominant and widely studied [13, 23, 43, 21, 19, 42]. Specifically, spectral augmentation has received significant attention, as it modifies a graph’s spectral properties. This approach is believed to enhance model performance, aligning with the proposed importance of spectral information in CG-SSL. However, there seems no consensus on the true effectiveness of spectral information in the previous works proposing and studying spectral augmentation. SpCo [21] introduces the general graph augmentation (GAME) rule, which suggests that the difference in high-frequency parts between augmented graphs should be larger than that of low-frequency parts. SPAN [19] contends that effective topology augmentation should prioritize perturbing sensitive edges that have a substantial impact on the graph spectrum. Therefore, a principled augmentation method is designed by directly maximizing spectral change with a certain perturbation budget, without mentioning any specific domain of spectrum. GASSER [42] selectively perturbs graph structures based on spectral cues to better maintain the required invariance for contrastive learning frameworks. Specifically, it aims to augment the graphs to preserve task-relevant frequency components and perturb the task-irrelevant ones with care. While all three related methods are augmentation-based and share in the set of CG-SSL frameworks like GRACE [49] and MVGRL [13], a contradiction emerges among these related works on spectral augmentation: while SPAN advocates for maximizing the distance between the spectrum of augmented graphs regardless of spectral domains, SpCo and GASSER argue for the preservation of specific spectral components and domains during augmentation. The consistent performance gain derived from opposing methodical designs naturally raises our concern:

• Are spectral augmentations necessary in contrast-based graph SSL?

Given the question, this study aims to critically evaluate the effectiveness and significance of spectral augmentation in contrast-based graph SSL frameworks (CG-SSL). With evidence-supported claims and findings in the following sections, we show that despite their computational complexity, sophisticated spectral augmentations do not demonstrate clear advantages over simple edge perturbations. Our extensive experiments reveal that straightforward edge perturbations consistently achieve superior performance while being significantly more computationally efficient. Our theoretical analysis on the InfoNCE loss bounds for shallow GNNs provides additional insights into understanding this phenomenon and supports our claims. We elaborate on our findings through a series of studies carried out in the following efforts:

1.

In Sec. 4, we demonstrate that shallow networks consistently achieve better performance in CG-SSL, analyze their inherent limitations in capturing global spectral information, and provide theoretical bounds on the InfoNCE loss that help explain the limited benefits of sophisticated spectral augmentations compared to simple edge perturbation.
2.

In Sec 5, we claim that simple edge perturbation techniques, like adding edges to or dropping edges from the graph, not only compete well but often outperform spectral augmentations, without any significant help from spectral cues. To support this,

(a) In Sec. 6, overall model performance on test accuracy with four state-of-the-art frameworks on both node- and graph-level classification tasks support the superiority of simple edge perturbation.
(b) Studies in Sec. 7.1 reveal the indistinguishability between the average spectrum of augmented graphs from edge perturbation with optimal parameters on different datasets, no matter how different that of original graphs is, indicating GNN encoders can hardly learn spectral information from augmented graphs. That is to say, edge perturbations can not benefit from spectral information.
(c) In Sec. 7.2, we analyze the effectiveness of state-of-the-art spectral augmentation baseline (i.e., SPAN) by perturbing edges to alter the spectral characteristics of augmented graphs from simple edge perturbation augmentation and examining the impact on model performance. As it turns out, the results show no performance degradation, indicating the spectral information contained in the augmentation is not significant to the model performance.
(d) In Appendix E.3, statistical analysis is carried out to argue that the major reason edge perturbation works well is not because of the spectral information as they are statistically not the key factor on model performance.

2 Related work

Contrast-based Graph Self-Supervised (CG-SSL). CG-SSL learning alleviates the limitations of supervised learning, which heavily depends on labeled data and often suffers from limited generalization [22]. This makes it a promising approach for real-world applications where labeled data is scarce.

CG-SSL applies a variety of augmentations to the training graph to obtain augmented views. These augmented views, which are derived from the same original graph, are treated as positive sample pairs or sets. The key objective of CG-SSL is to maximize the mutual information between these views to learn robust and invariant representations. However, directly computing the mutual information of graph representations is challenging. Hence, in practice, CG-SSL frameworks aim to maximize the lower bound of mutual information using different estimators such as InfoNCE [11], Jensen-Shannon [26], and Donsker-Varadhan [1]. For instance, frameworks like GRACE [49], GCC [29], and GCA [50] utilize the InfoNCE estimator as their objective function. On the other hand, MVGRL [13] and InfoGraph [34] adopt the Jensen-Shannon estimator.

Some CG-SSL methods explore alternative principles. G-BT [2] extends the redundancy-reduction principle by decorrelating representations between two augmented views to prevent feature collapse. BGRL [35] adopts a momentum-driven Siamese architecture, using node feature masking and edge modification as augmentations to maximize mutual information between online and target network representations.

Graph Augmentations in CG-SSL. Beyond the choice of objective functions, another crucial aspect of augmentation-based methods in CG-SSL is the selection of augmentation techniques. Early work by [49] and [43] introduced several domain-agnostic heuristic graph augmentation for CG-SSL, such as edge perturbation, attribute masking, and subgraph sampling. These straightforward and effective methods have been widely adopted in subsequent CG-SSL frameworks due to their demonstrated success [35, 44]. However, these domain-agnostic graph augmentations often lack interpretability, making it difficult to understand the exact impact of these augmentations on the graph structure and learning outcomes.

To address this issue, MVGRL [13] introduces graph diffusion as an augmentation strategy, where the original graph provides local structural information and the diffused graph offers global context.

Moreover, three spectral augmentation methods–SpCo [21], GASSER [42], and SPAN [19]–stand out by offering design principles based on spectral graph theory, focusing on how to enhance CG-SSL performance through spectral manipulations.

However, our explorations show that these methods are unable to consistently outperform heuristic graph augmentations such as edge perturbation (DropEdge or AddEdge) in terms of performance under fair comparisons, and thus the design principles of graph augmentation still require further validation.

3 Preliminary study

Contrast-based graph self-supervised learning framework. CG-SSL captures invariant features of a graph by generating multiple views (typically two) through augmentations and then maximizing the mutual information between these views [40]. This approach is ultimately used to improve performance on various downstream tasks. Following previous work [39, 22, 40], we first denote the generic form of the augmentation $\mathcal{T}$ and objective functions $\mathcal{L}_{cl}$ of graph contrastive learning. Given a graph $\mathcal{G}=(\mathbf{A},\mathbf{X})$ with adjacency matrix $\mathbf{A}$ and feature matrix $\mathbf{X}$ , the augmentation is defined as the transformation function $\mathcal{T}$ . In this paper, we are mainly concerned with topological augmentation, in which feature matrix $\mathbf{X}$ remains intact:

\widetilde{\mathbf{A}},\widetilde{\mathbf{X}}=\mathcal{T}(\mathbf{A},\mathbf{X})=\mathcal{T}(\mathbf{A}),\mathbf{X}

(1)

In practice, two augmented views of the graph are generated, denoted as $\mathcal{G}^{(1)}=\mathcal{G}(\mathcal{T}_{1}(\mathbf{A},\mathbf{X}))$ and $\mathcal{G}^{(2)}=\mathcal{G}(\mathcal{T}_{2}(\mathbf{A},\mathbf{X}))$ . The objective of GCL is to learn representations by minimizing the contrastive loss $\mathcal{L}_{cl}$ between the augmented views:

\theta^{*},\phi^{*}=\underset{\theta,\phi}{\arg\min}\mathcal{L}_{cl}\left(p_{\phi}\left(f_{\theta}\left({\mathcal{G}}^{(1)}\right),f_{\theta}\left({\mathcal{G}}^{(2)}\right)\right)\right),

(2)

where $f_{\theta}$ represents the graph encoder parameterized by $\theta$ , and $p_{\phi}$ is a projection head parameterized by $\phi$ . The goal is to find the optimal parameters $\theta^{*}$ and $\phi^{*}$ that minimize the contrastive loss.

In this paper, we utilize four prominent CG-SSL frameworks to study the effect of spectral: MVGRL, GRACE, BGRL, and G-BT. MVGRL introduces graph diffusion as augmentation, while the other three frameworks use edge perturbation as augmentation. Each framework employs different strategies for its contrastive loss functions. MVGRL and GRACE use the Jensen-Shannon and InfoNCE estimators as object functions, respectively. In contrast, BGRL and G-BT adopt the BYOL loss [9] and Barlow Twins loss [45], which are designed to maximize the agreement between the augmented views without relying on negative samples. A more detailed description of the loss function can be found in the Appendix C.

Graph spectrum & Definition and application of spectral augmentation. We follow the standard definition of graph spectrum in this study, details of which can be found in Appendix B. Among various augmentation strategies proposed to enhance the robustness and generalization of graph neural networks, spectral augmentation has been considered a promising avenue [19, 21, 3, 42]. Spectral augmentation typically involves implicit modifications to the eigenvalues of the graph Laplacian, aiming at enhancing model performance by encouraging invariance to certain spectral properties. Among them, SPAN achieved state-of-the-art performance in both node classification and graph classification. In short, SPAN elaborates two augmentation functions, $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ , where $\mathcal{T}_{1}$ maximizes the spectral norm in one view, and $\mathcal{T}_{2}$ minimizes it in the other view. Subsequently, these two augmentations are implemented in the four CG-SSL frameworks mentioned above (Strict definition in Appendix B). The paradigm used by SPAN aims to allow the GNN encoder to focus on robust spectral components and ignore the sensitive edges that can change the spectral drastically when perturbed.

4 Limitations of spectral augmentations

Limitations of shallow GNN encoders in capturing spectral information. Multiple previous studies indicate that shallow, rather than deep, GNN encoders can be effective in graph self-supervised learning. This might be the result of overfitting commonly witnessed in standard GNN tasks. We have also carried out many empirical studies with a range of CG-SSL frameworks and augmentations to support this idea in contrast-based graph SSL. As the most commonly applied GNN encoder in CG-SSL [43, 44, 10, 20], an empirical study on the relationship between the depth of GCN encoder and learning performance is conducted and results are presented in Fig. 1. From that, we can conclude that shallow GCN encoders with 1 or 2 layers usually have the best performance. Note that this tendency is not clear on graph-level tasks, which can be partially explained by the beneficial oversmoothing phenomenon present in this context [33]. It suggests that while deep encoders may have theoretically better expressive power than shallower ones, the limited benefits of deeper GNN architectures in the current CG-SSL practice imply that more layers may not bring significant improvements and could even hinder the quality of the learned graph representations.

By design, most GNN encoders primarily aggregate local neighborhood information through their layered structure, where each layer extends the receptive field by one hop. The depth of a GNN critically determines its ability to integrate information from various parts of the graph. With only a limited number of layers, a GNN’s receptive field is restricted to immediate neighborhoods (e.g., 1-hop or 2-hop distances). This limitation severely constrains the network’s ability to assimilate and leverage broader graph topologies or global features that are essential for encoding the spectral properties of the graph, given the definition of the graph spectrum.

Limited implications for spectral augmentation in CG-SSL. Given the inherent limitations of shallow GNNs in capturing spectral information, the utility of spectral augmentation techniques in graph self-supervised learning settings warrants scrutiny. While spectral augmentation techniques modify the graph’s spectral components (e.g., eigenvalues and eigenvectors) to enrich the training process, their benefits may be limited if the primary encoder—a shallow GNN—cannot effectively process these spectral properties. To formally validate this intuition, we establish the following theoretical analysis.

Theoretical Analysis of InfoNCE Bounds. To better understand the effectiveness of augmentations in shallow GNNs, we derive the following theoretical bounds on InfoNCE loss:

Theorem 1 (InfoNCE Loss Bounds).

Given a graph $\mathcal{G}$ with minimum degree $d_{\min}$ and maximum degree $d_{\max}$ , and its augmentation $\mathcal{G}^{\prime}$ with local topological perturbation strength $\delta$ , for a $k$ -layer GNN with ReLU activation and weight matrices satisfying $\left\|\mathbf{W}^{(l)}\right\|_{2}\leq L_{W}$ , and assuming that the embeddings are normalized ( $\left\|\mathbf{z}_{v}\right\|=\left\|\mathbf{z}^{\prime}_{v}\right\|=1$ ), the InfoNCE loss satisfies with high probability:

-\log\left(\frac{e^{1/\tau}}{e^{1/\tau}+(n-1)e^{-\epsilon^{\prime}/\tau}}\right)\leq\mathcal{L}_{\text{InfoNCE}}(\mathcal{G},\mathcal{G}^{\prime})\leq-\log\left(\frac{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}}{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}+(n-1)e^{\epsilon^{\prime}/\tau}}\right),

(3)

where $\epsilon$ is as defined in Lemma 5 and $\epsilon^{\prime}$ is as defined in Lemma 6. Detailed descriptions of notations can be found in Table 6. The proof of Theorem 1 and all related Lemmas can be found in Appendix D.

This theoretical result reveals that the InfoNCE loss naturally stays within a narrow interval given a perturbation strength, regardless of augmentation complexity. Such a finding helps explain why sophisticated spectral augmentations may not significantly outperform simple ones in shallow architectures. While the potential benefits of coupling spectral augmentations with deeper GNN architectures remain an open question.

Numerical Estimation and Interpretation.

To illustrate the derived bounds, we provide a numerical estimation of the upper and lower bounds based on realistic parameters for 1-layer GNNs (which is the case for best performance for a bunch of benchmarks presented in Sec. 6 later). As detailed in Appendix D.5, the parameters were chosen using typical graph augmentation settings and realistic assumptions about $\epsilon$ and $\epsilon^{\prime}$ derived from Lemma 5 and Lemma 6. The resulting bounds on the InfoNCE loss are: Lower bound: 4.7989, Upper bound: 5.4497. The difference between the two bounds is $5.4497-4.7989=0.6508$ , indicating that the InfoNCE loss remains tightly constrained under these settings. This small interval suggests that shallow GNNs cannot fully utilize complex spectral augmentations, as their expressive capacity limits the potential variation in mutual information captured from augmented views. Our analysis reveals a critical insight: the limited efficacy of spectral augmentation stems from the inability of shallow GNNs to effectively capture and leverage the spectral properties of a graph. Instead, the learning outcomes are more directly influenced by simpler factors, such as the strength of edge perturbations. These findings reinforce the practicality of straightforward augmentation methods like edge dropping and adding, which perform comparably or better in this constrained theoretical setting.

5 Edge perturbation is all you need

So far, our findings indicate that spectral augmentation is not particularly effective in contrast-based graph self-supervised learning. It may suggest that spectral augmentation essentially amounts to random topology perturbation, based on inconsistencies

in previous studies [19, 21, 42] and the theoretical insight that a shallow encoder can hardly capture spectral properties. In fact, most of the spectral augmentations are basically performing edge perturbations on the graph with some targeted directions. Since we preliminarily conclude that it is quite difficult for those augmentations to benefit from the spectral properties of graphs, it is very intuitive to hypothesize that edge perturbation itself matters in the learning process.

Consequently, we are turning back to Edge Perturbation (EP), a more straightforward and proven method for augmenting graph data. The two primary methods of edge perturbation are DropEdge and AddEdge. We want to claim that edge perturbation has a better performance than spectral augmentations and prove empirically that none of them actually or even can benefit much from any spectral information and properties. Also, we demonstrate edge perturbation is much more efficient in practical applications for both time and space sake, where spectral operations are almost infeasible. Overall, we will support the idea with evidence in the following sections that simple edge perturbation is not only good enough but even very optimal in CG-SSL compared to spectral augmentations.

Edge perturbation involves modifying the topology of the graph by either removing or adding edges at random. We detail the two main types of edge perturbation techniques used in our frameworks: edge dropping and edge adding.

DropEdge. Edge dropping is the process of randomly removing a subset of edges from the original graph to create an augmented view. Adopting the definition from [31], let $\mathcal{G}=(\mathbf{A},\mathbf{X})$ be the original graph with adjacency matrix $\mathbf{A}$ . We introduce a mask matrix $\mathbf{M}$ of the same dimensions as $\mathbf{A}$ , where each entry $M_{ij}$ follows a Bernoulli distribution with parameter $1-p$ (denoted as the drop rate). The edge-dropped graph $G^{\prime}$ is then obtained by element-wise multiplication of $\mathbf{A}$ with $\mathbf{M}$ (where $\odot$ denotes the Hadamard product):

\mathbf{A}^{\prime}=\mathbf{A}\odot\mathbf{M}

(4)

AddEdge. Edge adding involves randomly adding a subset of new edges to the original graph to create an augmented view. Let $\mathbf{N}$ be an adding matrix of the same dimensions as $\mathbf{A}$ , where each entry $N_{ij}$ follows a Bernoulli distribution with parameter $q$ (denoted as the add rate), and $N_{ij}=0$ for all existing edges in $\mathbf{A}$ . The edge-added graph $\mathcal{G}^{\prime\prime}$ is obtained by adding $\mathbf{N}$ to $\mathbf{A}$ :

\mathbf{A}^{\prime\prime}=\mathbf{A}+\mathbf{N}

(5)

These two operations ensure that the augmented views $\mathcal{G}^{(1)}$ and $\mathcal{G}^{(2)}$ have modified adjacency matrices $\mathbf{A}^{\prime}$ and $\mathbf{A}^{\prime\prime}$ respectively, which are used to generate contrastive views while preserving the feature matrix $\mathbf{X}$ .

5.1 Advantage of edge perturbation over spectral augmentations

Edge perturbation offers several key advantages over spectral augmentation, making it a more effective and practical choice for CG-SSL. Compared to spectrum-related augmentations, it has three major advantages.

Theoretically intuitive. Edge perturbation is inherently simpler and more intuitive. It directly modifies the graph’s structure by adding or removing edges, which aligns well with the shallow GNN encoders’ strength in capturing local neighborhood information. Given that shallow GNNs have a limited receptive field, they are better suited to leveraging the local structural changes introduced by edge perturbation rather than the global changes implied by spectral augmentation.

Significantly better efficiency. Edge perturbation methods such as edge dropping (DropEdge) and edge adding (AddEdge) are computationally efficient. Unlike spectral augmentation, which requires costly eigenvalue and eigenvector computations, edge perturbation can be implemented with basic graph operations. This efficiency translates to faster training and inference times, making it more suitable for large-scale graph datasets and real-time applications. As shown in Table 1, the time and space complexity of spectrum-related calculations are several orders of magnitude higher than those of simple edge perturbation operations. This makes spectrum-related calculations impractical for large datasets typically encountered in real-world applications.

Table 1: Time and space complexity of different methods (Empirical Time is on PubMed dataset)

Method	Time Complexity	Space Complexity	Empirical Time (s/epoch)
Spectrum calculation	$O(n^{3})$	$O(n^{2})$	26.435
DropEdge	$O(m)$	$O(m)$	0.140
AddEdge	$O(m)$	$O(m)$	0.159

Optimal learning performance.

Most importantly and directly, our comprehensive empirical studies indicate that edge perturbation methods lead to significant improvements in model performance, as presented and analyzed in Sec. 6. From the results there, the conclusion can be drawn that the performance of the proposed augmentations is not only better than those of spectral augmentations but also matches or even surpasses the performance of other strong benchmarks.

These advantages position edge perturbation as a robust and efficient method for graph augmentation in self-supervised learning. In the following section, we will present our experimental analysis, demonstrating the accuracy gains achieved through edge perturbation methods.

6 Experiments on SSL performance

6.1 Experimental Settings

Task and Datasets. We conducted extensive experiments for node-level classification on seven datasets: Cora, CiteSeer, PubMed [16], Photo, Computers [32], Coauthor-CS, and Coauthor-Phy. These datasets include various types of graphs, such as citation networks, co-purchase networks, and co-authored networks. Note that we do not include huge-scale datasets like OGBN [14] for the high complexity of spectral augmentations. While both DropEdge and AddEdge have linear complexity that can easily run on those huge datasets, no spectral augmentation can scale to them. Additionally, we carried out graph-level classification on five datasets from the TUDataset collection [24], which include biochemical molecules and social networks. More details of these datasets be found in Appendix A.

Baselines. We conducted experiments under four CG-SSL frameworks: MVGRL, GRACE, G-BT, and BGRL (mentioned in Sec 3), using DropEdge, AddEdge, and SPAN [19] as augmentation strategies. Note that there are only three very relevant studies on spectral augmentation strategies of CG-SSL to the authors’ best knowledge, i.e., SPAN, SpCo [21] and GASSER [42]. Among them, GASSER does not have open-sourced code so we are not able to reproduce any related results, but we try our best to directly adopt the best performance reported in that study to ensure comparison is possible. Also, SpCo is only applicable to node-level tasks and its implementation is not robust enough to generalize to all the node-level datasets and CG-SSL frameworks. Therefore, we manage to include the results of all the settings that it is feasible to do, which is its original setting and the combination of GRACE and it. Given the infeasibility and inaccessibility of the two, we selected SPAN as a major baseline because it is robust and general enough to all the datasets and experimental settings while allowing the modular plug-and-play integration of edge perturbation methods, enabling a very direct angle to evaluate the effectiveness of the spectral augmentations compared to much simpler alternatives.

Besides the major baselines mentioned above, other related ones are added to clearly and comprehensively benchmark our work. For MVGRL, we also compared its original PPR augmentation. For the node classification task, we use GCA [50], GMI [28], DGI [36], CCA-SSG [46] and SpCo [21] as baselines. For the graph classification task, we use RGCL [18] and GraphCL [43] as baselines. Detailed experimental configurations are in Appendix A.

Evaluation Protocol. We adopt the evaluation and split scheme from previous works [37, 47, 19]. Each GNN encoder is trained on the entire graph with self-supervised learning. After training, we freeze the encoder and extract embeddings for all nodes or graphs. Finally, we train a simple linear classifier using the labels from the training/validation set and test it with the testing set. The accuracy of classification on the testing set shows how good the learned representations are. For the node classification task nodes are randomly divided into $10\%/10\%/80\%$ for training, validation, and testing, and for graph classification datasets, graphs are randomly divided into $80\%/10\%/10\%$ for training, validation, and testing.

6.2 Experimental results

We present the prediction accuracy of the node classification and graph classification tasks in Table 2 and Table 3, respectively. Our comprehensive analysis reveals distinct patterns in the effectiveness of different augmentation strategies across these two task types. For node classification, DropEdge consistently achieves the best performance across multiple datasets and CG-SSL frameworks, demonstrating superior robustness and consistency. While AddEdge also achieves competitive accuracy, DropEdge stands out in this area. In graph classification, AddEdge frequently achieves the best performance across multiple datasets and CG-SSL frameworks, showing superior and more consistent results. The effectiveness of AddEdge in graph classification may be attributed to ’beneficial oversmoothing’ as proposed by [33]. In graph-level tasks, the convergence of node features to a common representation aligned with the global output can be advantageous. By potentially increasing graph density and the proportion of positively curved edges [25], AddEdge might facilitate this beneficial effect in graph classification tasks. Notably, all the results from SPAN as well as GASSER and SpCo generally underperform relative to both DropEdge and AddEdge while also encountering scalability issues on larger datasets and suffering from a high overhead of training time.

Table 2: Node classification. Results of baselines with ’

\dagger

’ are adopted directly from previous works. MVGRL+PPR is the original setting of MVGRL. The best results in each cell are highlighted in grey. The best results overall are highlighted with bold and underline. Metric is accuracy (%).

Model	Cora	CiteSeer	PubMed	Photo	Computers	Coauthor-CS	Coauthor-Phy
GCA^†	83.67 $\pm$ 0.44	71.48 $\pm$ 0.26	78.87 $\pm$ 0.49	92.53 $\pm$ 0.16	88.94 $\pm$ 0.15	93.10 $\pm$ 0.01	95.68 $\pm$ 0.05
GMI^†	83.02 $\pm$ 0.33	72.45 $\pm$ 0.12	79.94 $\pm$ 0.25	90.68 $\pm$ 0.17	82.21 $\pm$ 0.31	91.08 $\pm$ 0.56	—
DGI^†	82.34 $\pm$ 0.64	71.85 $\pm$ 0.74	76.82 $\pm$ 0.61	91.61 $\pm$ 0.22	83.95 $\pm$ 0.47	92.15 $\pm$ 0.63	94.51 $\pm$ 0.52
CCA-SSG^†	84.20 $\pm$ 0.40	73.10 $\pm$ 0.30	81.60 $\pm$ 0.40	93.14 $\pm$ 0.14	88.74 $\pm$ 0.28	93.31 $\pm$ 0.22	95.38 $\pm$ 0.06
SpCo	83.78 $\pm$ 0.70	71.82 $\pm$ 1.26	80.86 $\pm$ 0.43	—	—	—	—
GASSER^†	85.27 $\pm$ 0.10	75.41 $\pm$ 0.84	83.00 $\pm$ 0.61	93.17 $\pm$ 0.31	88.67 $\pm$ 0.15	—	—
MVGRL + PPR	83.53 $\pm$ 1.19	71.56 $\pm$ 1.89	84.13 $\pm$ 0.26	88.47 $\pm$ 1.02	89.84 $\pm$ 0.12	90.57 $\pm$ 0.61	OOM
MVGRL + DropEdge	84.31 $\pm$ 1.95	74.85 $\pm$ 0.73	85.62 $\pm$ 0.45	89.28 $\pm$ 0.95	90.43 $\pm$ 0.33	93.20 $\pm$ 0.81	95.70 $\pm$ 0.28
MVGRL + AddEdge	83.21 $\pm$ 1.65	73.65 $\pm$ 1.60	84.86 $\pm$ 1.19	87.15 $\pm$ 1.36	87.59 $\pm$ 0.53	92.91 $\pm$ 0.65	95.33 $\pm$ 0.23
MVGRL +SPAN	84.57 $\pm$ 0.22	73.65 $\pm$ 1.29	85.21 $\pm$ 0.81	92.33 $\pm$ 0.99	88.75 $\pm$ 0.20	92.25 $\pm$ 0.76	OOM
MVGRL + GASSER^†	80.36 $\pm$ 0.05	74.48 $\pm$ 0.73	80.80 $\pm$ 0.19	—	—	—	—
G-BT + DropEdge	86.51 $\pm$ 2.04	72.95 $\pm$ 2.46	87.10 $\pm$ 1.21	93.55 $\pm$ 0.60	88.66 $\pm$ 0.46	93.31 $\pm$ 0.05	96.06 $\pm$ 0.24
G-BT + AddEdge	82.10 $\pm$ 1.48	66.36 $\pm$ 4.25	85.98 $\pm$ 0.81	93.68 $\pm$ 0.79	87.81 $\pm$ 0.79	91.98 $\pm$ 0.66	95.51 $\pm$ 0.02
G-BT + SPAN	84.06 $\pm$ 2.85	67.46 $\pm$ 3.18	85.97 $\pm$ 0.41	91.85 $\pm$ 0.22	88.73 $\pm$ 0.62	92.63 $\pm$ 0.07	OOM
GRACE + DropEdge	84.19 $\pm$ 2.07	75.44 $\pm$ 0.32	87.84 $\pm$ 0.37	92.62 $\pm$ 0.73	86.67 $\pm$ 0.61	93.15 $\pm$ 0.23	OOM
GRACE + AddEdge	85.78 $\pm$ 0.62	71.65 $\pm$ 1.63	85.25 $\pm$ 0.47	89.93 $\pm$ 0.74	76.74 $\pm$ 0.57	92.46 $\pm$ 0.25	OOM
GRACE + SPAN	82.84 $\pm$ 0.91	67.76 $\pm$ 0.21	85.11 $\pm$ 0.71	93.72 $\pm$ 0.21	88.71 $\pm$ 0.06	91.72 $\pm$ 1.75	OOM
GRACE + GASSER^†	84.10 $\pm$ 0.26	74.47 $\pm$ 0.64	83.97 $\pm$ 0.52	—	—	—	—
GRACE + SpCo	81.61 $\pm$ 0.75	70.83 $\pm$ 1.47	84.97 $\pm$ 1.13	—	—	—	—
BGRL + DropEdge	83.21 $\pm$ 3.29	71.46 $\pm$ 0.56	86.28 $\pm$ 0.13	92.90 $\pm$ 0.69	88.68 $\pm$ 0.65	91.58 $\pm$ 0.18	95.29 $\pm$ 0.19
BGRL + AddEdge	81.49 $\pm$ 1.21	69.66 $\pm$ 1.34	84.54 $\pm$ 0.22	91.85 $\pm$ 0.75	86.75 $\pm$ 1.15	91.78 $\pm$ 0.77	95.29 $\pm$ 0.09
BGRL + SPAN	83.33 $\pm$ 0.45	66.26 $\pm$ 0.92	85.97 $\pm$ 0.41	91.72 $\pm$ 1.75	88.61 $\pm$ 0.59	92.29 $\pm$ 0.59	OOM

Table 3: Graph classification. Results of baselines with ’

\dagger

Model	MUTAG	PROTEINS	NCI1	IMDB-BINARY	IMDB-MULTI
GraphCL^†	86.80 $\pm$ 1.34	74.39 $\pm$ 0.45	77.87 $\pm$ 0.41	71.14 $\pm$ 0.44	48.58 $\pm$ 0.67
RGCL^†	87.66 $\pm$ 1.01	75.03 $\pm$ 0.43	78.14 $\pm$ 1.08	71.85 $\pm$ 0.84	49.31 $\pm$ 0.42
MVGRL + PPR	90.00 $\pm$ 5.40	78.92 $\pm$ 1.83	78.78 $\pm$ 1.52	71.40 $\pm$ 4.17	52.13 $\pm$ 1.42
MVGRL+ SPAN	93.33 $\pm$ 2.22	79.81 $\pm$ 2.45	77.56 $\pm$ 1.77	75.00 $\pm$ 1.09	51.20 $\pm$ 1.62
MVGRL+ DropEdge	93.33 $\pm$ 2.22	78.92 $\pm$ 1.33	77.81 $\pm$ 1.50	76.40 $\pm$ 0.48	51.46 $\pm$ 3.02
MVGRL+ AddEdge	94.44 $\pm$ 3.51	81.25 $\pm$ 3.43	77.27 $\pm$ 0.71	74.00 $\pm$ 2.82	51.73 $\pm$ 2.43
G-BT + SPAN	90.00 $\pm$ 6.47	80.89 $\pm$ 3.22	78.29 $\pm$ 1.12	65.60 $\pm$ 1.35	45.60 $\pm$ 2.13
G-BT + DropEdge	92.59 $\pm$ 2.61	77.97 $\pm$ 0.42	78.18 $\pm$ 0.91	73.33 $\pm$ 1.24	49.11 $\pm$ 1.25
G-BT + AddEdge	92.59 $\pm$ 2.61	80.64 $\pm$ 1.68	75.91 $\pm$ 0.59	73.33 $\pm$ 1.24	48.88 $\pm$ 1.13
GRACE + SPAN	90.00 $\pm$ 4.15	79.10 $\pm$ 2.30	78.49 $\pm$ 0.79	70.80 $\pm$ 3.96	47.73 $\pm$ 1.71
GRACE + DropEdge	88.88 $\pm$ 3.51	78.21 $\pm$ 1.92	76.93 $\pm$ 1.14	71.00 $\pm$ 3.75	47.46 $\pm$ 3.02
GRACE + AddEdge	92.22 $\pm$ 4.44	80.17 $\pm$ 2.21	79.97 $\pm$ 2.35	71.67 $\pm$ 2.36	49.86 $\pm$ 4.09
BGRL + SPAN	90.00 $\pm$ 4.15	79.28 $\pm$ 2.73	78.05 $\pm$ 1.62	72.40 $\pm$ 2.57	47.46 $\pm$ 4.35
BGRL + DropEdge	88.88 $\pm$ 4.96	76.60 $\pm$ 2.21	76.15 $\pm$ 0.43	71.60 $\pm$ 3.31	51.47 $\pm$ 3.02
BGRL + AddEdge	91.11 $\pm$ 5.66	79.46 $\pm$ 2.18	76.98 $\pm$ 1.40	72.80 $\pm$ 2.48	47.77 $\pm$ 4.18

6.3 Ablation Study

To validate our findings, we conducted a series of ablation experiments on two exemplar datasets, Cora and MUTAG, representing node- and graph-level tasks, respectively. These ablation studies are crucial to rule out potential confounding variables, such as model architectures and hyperparameters, ensuring that our conclusions about the performance of CG-SSL are robust and comprehensive.

Number of Layers of GCN Encoder. To assess the impact of model depth, we conducted both node-level and graph-level experiments using varying numbers of GCN encoder layers. This analysis is to rule out the possibility that model depth, rather than augmentation strategies, influences the claim. As expected, the results, detailed in Appendix E.1, show that deeper encoders generally lead to worse performance. This suggests that excessive model complexity may introduce noise or overfitting, diminishing the benefits of spectral information. Therefore, our conclusion still holds tightly.

Type of GNN Encoder. While we initially selected GCN to align with the common protocols in previous studies for a fair comparison, we also explored other GNN architectures to ensure our findings are not specific to GCN alone. To further validate our results, we conducted additional experiments using GAT [37] for both node- and graph-level tasks, as well as GPS [30] for the graph-level task. As reported in Appendix E.2, the performance trends observed with GAT and GPS are consistent with those obtained using GCN. This consistency across different encoder types further supports our conclusion that simple edge perturbation strategies are sufficient, and that spectral augmentation does not significantly enhance performance, regardless of the type of GNN encoder applied.

7 The insignificance of Spectral Cues

Given the superior empirical performance of edge perturbations mentioned in Sec. 6, one may still argue whether it is a result of some spectral cues or not, as all the analyses mentioned are not direct evidence of the insignificance of the spectral information in the study. To clarify this, we have three questions to answer, (1) Can GNN encoders learn spectral information from augmented graphs produced edge perturbations? (2) Are spectrum in spectral augmentation necessary? (3) Is spectral information statistically a significant factor in the performance of edge perturbation? Given the questions, we conduct a series of experimental studies to answer them respectively in Sec. 7.1, 7.2 and Appendix E.3.

7.1 Degeneration of the spectrum after Edge Perturbation (EP)

Here we want to conduct studies to answer the question of whether the GNN encoders applied can learn spectral information from the augmented graph views produced by EP. Therefore, we collect the spectrum of all augmented graphs ever produced along the way of the contrastive learning process of the best framework with the optimal parameter we have in this study, i.e., G-BT + EP with best drop rate $p$ or add rate $q$ , and calculate the average one for each representative dataset in this study for both node- and graph-level tasks. We find that though the average spectrum of those original graphs is strikingly different, that of augmented graphs is quite similar for node- and graph-level tasks, respectively. This indicates a certain degree of degeneration of the spectra as they are no longer easy to separate after EP. Therefore, GNN encoders can hardly learn spectral information and properties between different original graphs from those augmented graph views. Note that, though we have defined some context of frameworks, this result is generally only dependent on the augmentation methods. We will elaborate on both the node-level and graph-level results in this section.

Node-Level Analysis. Here, we visualize the distributions of the average spectrum of graphs at the node level using histograms. The spectral distribution for each graph is represented by a sorted vector of its eigenvalues. When referring to the average spectrum, we mean the average over the eigenvalue vectors of each augmented graph. We plot the histograms of different spectra, normalized to show the probability density. Note that eigenvalues are constrained within the range [0, 2], as we adopted the commonly used symmetrical normalization. We analyze the spectral distributions of three node classification datasets: Cora, CiteSeer, and Computers. We compare the average spectral properties of both original and augmented graphs. The augmentation method used is DropEdge, applied with optimal parameters identified for the G-BT method. The results of the visualization are presented in Fig. 3. By comparing the spectrum distributions of original graphs for the datasets in Fig. 3(a), we can easily distinguish the spectra of the three datasets. This contrasts with the highly overlapped average spectra of all the datasets, indicating the degeneration mentioned. To support this claim, we also present the comparison of the spectra of original and augmented graphs on all three datasets in Fig. 3(c), 3(d), and 3(e), respectively, to show the obvious changes after the edge perturbations.

Graph-Level analysis. For graph-level analysis, we basically follow the settings mentioned above in node-level one. The only difference from the node-level task is that we have multiple original graphs with various numbers of nodes, leading to the inconsistent dimensions of the vector of the eigenvalues. Therefore, to provide a more detailed comparison of spectral properties at the graph level, we employ Kernel Density Estimation (KDE) [27] to interpolate and smooth the distributions of eigenvalues. We compare two groups of graph spectra. Each group’s spectra are processed to compute their KDEs, and the mean and standard deviation of these KDEs are calculated.

We analyze the spectral distributions of two node classification datasets: MUTAG and PROTEINS. We compare the average spectral properties of both original and augmented graphs. The augmentation method used is AddEdge as it is the better among two EP methods, applied with optimal add rate identified for the G-BT method.

Like the results in node-level analysis, in Fig. 2(a) and 2(b), we witness the obvious difference between the average spectra of original graphs while the significant overlap between those of augmented graphs, especially if pay attention to the overlapping of the area created by the standard deviation of KDEs. Again, this contrast is not trivial because of the striking mismatch between the average spectra of original and augmented graphs in both datasets, as presented in Fig. 2(c) and 2(d).

7.2 Spectral Perturbation

To further destruct the spectral properties from model performance, we introduce Spectral Perturbation Augmentor (SPA) for finer-grained anatomy. SPA performs random edge perturbation with an empirically negligible ratio $r_{SPA}$ to transform the input graph $\mathcal{G}$ into a new graph $\mathcal{G}_{SPA}$ , such that $\mathcal{G}$ and $\mathcal{G}_{SPA}$ are close to each other topologically, while being divergent in the spectral space. The spectral divergence $d_{SPA}$ between $\mathcal{G}$ and $\mathcal{G}_{SPA}$ is measured by the $L_{2}$ -distance of the respective spectra. With properly chosen hyperparameters $r_{SPA}$ and $d_{SPA}$ , we view the augmented graph $\mathcal{G}_{SPA}$ as a doppelganger of $\mathcal{G}$ that preserves most of the graph-proximity, with only spectral information eliminated.

Spectral perturbation on spectral augmentation baselines. SPAN, being a state-of-the-art spectral augmentation algorithm, demonstrated the correlation between graph spectra and model performance through designated perturbation on spectral priors. However, the effectiveness of simple edge perturbation motivated us to further investigate whether such a relationship is causational.

Specifically, for each pair of SPAN augmented graphs $\mathcal{G}^{1},\mathcal{G}^{2}$ , we further augment them into $\mathcal{G}^{1}_{SPA},\mathcal{G}^{2}_{SPA}$ with our proposed SPA augmentor. The SPA-augmented training is performed under the same setup as SPAN, with graphs being SPA-augmented graphs $\mathcal{G}_{SPA}$ . Results in Fig 4 show that the effectiveness of graph augmentation can be preserved and, in some cases improved, even if the spectral information is destroyed.

SPAN, along with other spectral augmentation algorithms, can be formulated as an optimization on a parameterized $2$ -step generative process:

\displaystyle s_{SPAN}\sim p_{\theta}\left(\bm{S}_{SPAN}\left.\right|\bm{\mathcal{G}}_{0}\right),\qquad\mathcal{G}_{SPAN}\sim p_{\phi}\left(\bm{\mathcal{G}}_{SPAN}\left.\right|\bm{S}_{SPAN}\right)

(6)

Given the property that $\mathcal{G}_{SPA}$ is topologically close to $\mathcal{G}_{SPAN}$ and the performance function $\text{P}=f\left(\mathcal{G}\right),\lim_{\mathcal{G}\rightarrow\mathcal{G}_{SPAN}}\text{P}\left(\mathcal{G}\right)=\text{P}\left(\mathcal{G}_{SPAN}\right)$ , which indicates the continuity around $\mathcal{G}_{SPAN}$ , we make a reasonable assertion that $\mathcal{G}_{SPA}$ comes from the same distribution as $\mathcal{G}_{SPAN}$ . However, with their spectral space being enforced to be distant, $\mathcal{G}_{SPA}$ is almost impossible to be sampled from the same spectral augmentation generative process:

\displaystyle d_{SPA}\rightarrow\infty\implies p_{\theta}\left(s_{SPA}\left.\right|\bm{\mathcal{G}}_{0}\right)\rightarrow 0\implies p_{\theta,\phi}\left(\mathcal{G}_{SPA}\left.\right|\bm{\mathcal{G}}_{0}\right)\rightarrow 0

(7)

Although the constrained generative process in Eq. 6 does indicate some extent of causality between spectral distribution $\bm{S}$ and the spectral-augmented graph distribution $\bm{\mathcal{G}}_{SPAN}$ , our experiment challenges a more essential and fundamental aspect of such reasoning: such causality exists upon pre-defined generative processes, which does not intrinsically exist in the graph distributions. Even worse, such constrained generative process is incapable of modeling the full distribution of $\bm{\mathcal{G}}_{SPAN}$ itself. In our experiment setup, all $\mathcal{G}_{SPA}$ serve as strong counter examples.

8 Conclusion

In this study, we investigate the effectiveness of spectral augmentation in contrast-based graph self-supervised learning (CG-SSL) frameworks to answer the question: Are spectral augmentations necessary in CG-SSL? Our findings indicate that spectral augmentation does not significantly enhance learning efficacy. Instead, simpler edge perturbation techniques, such as random edge dropping for node-level tasks and random edge adding for graph-level tasks, not only compete well but often outperform spectral augmentations. To be specific, we demonstrate that the benefits of spectral augmentation diminish with shallower networks, and edge perturbations yield superior performance in both node- and graph-level classification tasks. Also, GNN encoders struggle to learn spectral information from augmented graphs, and perturbing edges to alter spectral characteristics does not degrade model performance. Furthermore, our theoretical analysis (Theorem 1) reveals that the InfoNCE loss bounds the mutual information achievable by augmentations, highlighting that the relatively limited direct contribution of spectral augmentations compared to simpler edge perturbations, especially when in shallow GNNs. These results challenge the current emphasis on spectral augmentation, advocating for more straightforward and effective edge perturbation techniques in CG-SSL, potentially refining the understanding and implementation of graph self-supervised learning.

References

[1] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pages 531–540. PMLR, 2018.
[2] Piotr Bielak, Tomasz Kajdanowicz, and Nitesh V Chawla. Graph barlow twins: A self-supervised representation learning framework for graphs. Knowledge-Based Systems, 256:109631, 2022.
[3] Deyu Bo, Yuan Fang, Yang Liu, and Chuan Shi. Graph contrastive learning with stable and scalable spectral encoding. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[4] Jingyu Chen, Runlin Lei, and Zhewei Wei. PolyGCL: GRAPH CONTRASTIVE LEARNING via learnable spectral polynomial filters. In The Twelfth International Conference on Learning Representations, 2024.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[6] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383–10395, 2020.
[7] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[9] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
[10] Xiaojun Guo, Yifei Wang, Zeming Wei, and Yisen Wang. Architecture matters: Uncovering implicit mechanisms in graph contrastive learning. Advances in Neural Information Processing Systems, 36, 2024.
[11] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
[12] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
[13] Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International conference on machine learning, pages 4116–4126. PMLR, 2020.
[14] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs, 2021.
[15] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pages 2323–2332. PMLR, 2018.
[16] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[17] Taewook Ko, Yoonhyuk Choi, and Chong-Kwon Kim. Universal graph contrastive learning with a novel laplacian perturbation. In Uncertainty in Artificial Intelligence, pages 1098–1108. PMLR, 2023.
[18] Sihang Li, Xiang Wang, An Zhang, Yingxin Wu, Xiangnan He, and Tat-Seng Chua. Let invariant rationale discovery inspire graph contrastive learning. In International conference on machine learning, pages 13052–13065. PMLR, 2022.
[19] Lu Lin, Jinghui Chen, and Hongning Wang. Spectral augmentation for self-supervised learning on graphs. In The Eleventh International Conference on Learning Representations, 2023.
[20] Minhua Lin, Teng Xiao, Enyan Dai, Xiang Zhang, and Suhang Wang. Certifiably robust graph contrastive learning. Advances in Neural Information Processing Systems, 36, 2024.
[21] Nian Liu, Xiao Wang, Deyu Bo, Chuan Shi, and Jian Pei. Revisiting graph contrastive learning from the perspective of graph spectrum. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[22] Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and S Yu Philip. Graph self-supervised learning: A survey. IEEE transactions on knowledge and data engineering, 35(6):5879–5900, 2022.
[23] Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and Philip S. Yu. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering, 35(6):5879–5900, 2023.
[24] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663, 2020.
[25] Khang Nguyen, Nong Minh Hieu, Vinh Duc Nguyen, Nhat Ho, Stanley Osher, and Tan Minh Nguyen. Revisiting over-smoothing and over-squashing using ollivier-ricci curvature. In International Conference on Machine Learning, pages 25956–25979. PMLR, 2023.
[26] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.
[27] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
[28] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, pages 259–270, 2020.
[29] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1150–1160, 2020.
[30] Ladislav Rampášek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
[31] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2020.
[32] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
[33] Joshua Southern, Francesco Di Giovanni, Michael Bronstein, and Johannes F Lutzeyer. Understanding virtual nodes: Oversmoothing, oversquashing, and node heterogeneity. arXiv preprint arXiv:2405.13526, 2024.
[34] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000, 2019.
[35] Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Rémi Munos, Petar Veličković, and Michal Valko. Bootstrapped representation learning on graphs. In ICLR 2021 Workshop on Geometrical and Topological Representation Learning, 2021.
[36] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. ICLR (Poster), 2(3):4, 2019.
[37] Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations, 2019.
[38] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
[39] Lirong Wu, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering, 35(4):4216–4235, 2021.
[40] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. Self-supervised learning of graph neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence, 45(2):2412–2429, 2022.
[41] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
[42] Kaiqi Yang, Haoyu Han, Wei Jin, and Hui Liu. Augment with care: Enhancing graph contrastive learning with selective spectrum perturbation, 2023.
[43] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
[44] Yue Yu, Xiao Wang, Mengmei Zhang, Nian Liu, and Chuan Shi. Provable training for graph contrastive learning. Advances in Neural Information Processing Systems, 36, 2024.
[45] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning, pages 12310–12320. PMLR, 2021.
[46] Hengrui Zhang, Qitian Wu, Junchi Yan, David Wipf, and Philip S Yu. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34:76–89, 2021.
[47] Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. Spectral feature augmentation for graph contrastive learning and beyond. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11289–11297, 2023.
[48] Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An empirical study of graph contrastive learning. NeurIPS, 2021.
[49] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
[50] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pages 2069–2080, 2021.

Appendix A Dataset and training configuration

Datasets. The node classification datasets used in this paper include the Cora, CiteSeer, and PubMed citation networks [16], as well as the Photo and Computers co-purchase networks [32]. Additionally, we use the Coauthor-CS and Coauthor-Phy co-author relationship networks. The statistics of node-level datasets are present in Table 4. The graph classification datasets include: The MUTAG dataset, which features seven types of graphs derived from 188 mutagenic compounds; the NCI1 dataset, which contains compounds tested for their ability to inhibit human tumor cell growth; the PROTEINS dataset, where nodes correspond to secondary structure elements connected if they are adjacent in 3D space; and the IMDB-BINARY and IMDB-MULTI movie collaboration datasets, where graphs depict interactions among actors and actresses, with edges denoting their collaborations in films. These movie graphs are labeled according to their genres. The statistics of graph-level datasets are present in Table 5. All datasets can be accessed through PyG library ¹¹1https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html. All experiments are conducted using 8 NVIDIA A100 GPU.

Table 4: Statistics of node classification datasets

Dataset	#Nodes	#Edges	#Features	#Classes
Cora	2,708	5,429	1,433	7
CiteSeer	3,327	4,732	3,703	6
PubMed	19,717	44,338	500	3
Computers	13,752	245,861	767	10
Photo	7,650	119,081	745	8
Coauthor-CS	18,333	81,894	6,805	15
Coauthor-Phy	34,493	247,962	8,415	5

Table 5: Statistics of node classification datasets

Dataset	#Avg. Nodes	#Avg. Edges	# Graphs	#Classes
MUTAG	17.93	19.71	188	2
PROTEINS	39.06	72.82	1,113	2
NCI1	29.87	32.30	4110	2
IMDB-BINARY	19.8	96.53	1,000	2
IMDB-MULTI	13.0	65.94	1,500	5

Training configuration. For each CG-SSL framework, we implement it based on [48] ²²2https://github.com/PyGCL/PyGCL. We use the following hyperparameters: the learning rate is set to $5\times 10^{-4}$ , and the node hidden size is set to $512$ , the number of GCN encoder layer is set $\in\{1,2\}$ . For all node classification datasets, training epochs are set $\in\{50,100,150,200,400,1000\}$ , and for all graph classification datasets, training epochs are set $\in\{20,40,...,200\}$ . To achieve performance closer to the global optimum, we use randomized search to determine the optimal probability of edge perturbation and SPAN perturbation ratio. For Cora and CiteSeer the search is conducted one hundred times, and for all other datasets, it is conducted twenty times. For all graph classification datasets, the batch size is set to $128$ .

Appendix B Preliminaries of Graph Spectrum and SPAN

Given a graph $\mathcal{G}=(\mathbf{A},\mathbf{X})$ with adjacency matrix $\mathbf{A}$ and feature matrix $\mathbf{X}$ , we introduce some fundamental concepts related to the graph spectrum.

Laplacian Matrix Spectrum The Laplacian matrix $\mathbf{L}$ of a graph is defined as:

\mathbf{L}=\mathbf{D}-\mathbf{A}

where $\mathbf{D}$ is the degree matrix, a diagonal matrix where each diagonal element $D_{ii}$ represents the degree of vertex $i$ . The eigenvalues of the Laplacian matrix, known as the Laplacian spectrum, are crucial in understanding the graph’s structural properties, such as its connectivity and the number of spanning trees [7].

Normalized Laplacian Spectrum The normalized Laplacian matrix $\mathbf{L}_{\text{norm}}$ is given by:

\mathbf{L}_{\text{norm}}=\mathbf{D}^{-1/2}\mathbf{L}\mathbf{D}^{-1/2}

The eigenvalues of the normalized Laplacian matrix, referred to as the normalized Laplacian spectrum, are often used in spectral clustering [38] and other applications where normalization is necessary to account for varying vertex degrees.

SPAN The core assumption of SPAN is to maximize the consistency of the representations of two views with a large spectrum distance, thereby filtering out edges sensitive to the spectrum, such as edges between clusters. By focusing on more stable structures relative to the spectrum, the objective of SPAN can be formulated as:

\max_{\bm{\mathcal{T}}_{1},\bm{\mathcal{T}}_{2}\in\mathcal{S}}\left\|\operatorname{eig}\left(\mathbf{L}_{1}\right)-\operatorname{eig}\left(\mathbf{L}_{2}\right)\right\|_{2}^{2}

(8)

where the transformations $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ convert $\mathbf{A}$ to $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ , respectively, producing the normalized Laplacian matrices $\mathbf{L}_{1}$ and $\mathbf{L}_{2}$ . Here, $\mathcal{S}$ represents the set of all possible transformations, and the graph spectrum can be calculated by $\operatorname{eig}\left(\mathbf{L}\right)$ .

Appendix C Objective function of GCL framework

Here we briefly introduce the objective functions of the four CG-SSL frameworks used in this paper, for a more detailed discussion about objective functions including other graph contrastive learning and graph self-supervised learning frameworks which can refer to the survey papers [40, 39, 22]. We use the following notations:

•

$p_{\phi}$ : Projection head parameterized by $\phi$ .
•

$\mathbf{h}_{i}$ , $\mathbf{h}_{j}$ : Representations of the graph nodes.
•

$\mathbf{h}_{n}^{\prime}$ : Representations of negative sample nodes.
•

$\mathcal{P}$ : Distribution of positive sample pairs.
•

$\widetilde{\mathcal{P}}^{N}$ : Distribution of negative sample pairs.
•

$\mathcal{B}$ : Set of nodes in a batch.
•

$\mathbf{H}^{(\mathbf{1})}$ , $\mathbf{H}^{(\mathbf{2})}$ : Node representation matrices of two views.

GRACE uses the InfoNCE loss to maximize the similarity between positive pairs and minimize the similarity between negative pairs. InfoNCE loss encourages representations of positive pairs (generated from the same node via data augmentation) to be similar while pushing apart the representations of negative pairs (from different nodes). The loss function $\mathcal{L}_{\text{NCE }}$ denotes as:

\mathcal{L}_{\text{NCE }}\left(p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right)=-\mathbb{E}_{\mathcal{P}\times\widetilde{\mathcal{P}}^{N}}\left[\log\frac{e^{p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}}{e^{p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)}+\sum_{n\in N}e^{p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{n}^{\prime}\right)}}\right]

(9)

MVGRL employs the Jensen-Shannon Estimator (JSE) for contrastive learning, which focuses on the mutual information between positive pairs and negative pairs.JSE maximizes the mutual information between positive pairs and minimizes it for negative pairs, thus improving the representations’ alignment and uniformity. The loss function $\mathcal{L}_{\text{JSE }}$ denotes as:

\mathcal{L}_{\text{JSE }}\left(p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right)=\mathbb{E}_{\mathcal{P}\times\tilde{\mathcal{P}}}\left[\log\left(1-p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}^{\prime}\right)\right)\right]-\mathbb{E}_{\mathcal{P}}\left[\log\left(p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right)\right]

(10)

BGRL utilizes a loss similar to BYOL, which does not require negative samples. It uses two networks, an online network and a target network, to predict one view from the other:

\mathcal{L}_{\text{BYOL }}\left(p_{\phi}\left(\mathbf{h}_{i},\mathbf{h}_{j}\right)\right)=\mathbb{E}_{\mathcal{P}\times\mathcal{P}}\left[2-2\cdot\frac{\left[p_{\phi}\left(\mathbf{h}_{i}\right)\right]^{T}\mathbf{h}_{j}}{\left\|p_{\phi}\left(\mathbf{h}_{i}\right)\right\|\left\|\mathbf{h}_{j}\right\|}\right]

(11)

G-BT applies the Barlow Twins’ loss to reduce redundancy in the learned representations, thereby ensuring better generalization:

	$\displaystyle\mathcal{L}_{\text{BT }}\left(\mathbf{H}^{(\mathbf{1})},\mathbf{H}^{(\mathbf{2})}\right)=$	$\displaystyle\mathbb{E}_{\mathcal{B}\sim\mathcal{P}\|\mathcal{B}\|}\left[\sum_{a}\left(1-\frac{\sum_{i\in\mathcal{B}}\mathbf{H}_{ia}^{(1)}\mathbf{H}_{ia}^{(2)}}{\left\\|\mathbf{H}_{ia}^{(1)}\right\\|\left\\|\mathbf{H}_{ia}^{(2)}\right\\|}\right)^{2}\right.$		(12)
		$\displaystyle\left.+\lambda\sum_{a}\sum_{b\neq a}\left(\frac{\sum_{i\in\mathcal{B}}\mathbf{H}_{ia}^{(1)}\mathbf{H}_{ib}^{(2)}}{\left\\|\mathbf{H}_{ia}^{(1)}\right\\|\left\\|\mathbf{H}_{ib}^{(2)}\right\\|}\right)^{2}\right].$		(12)

Appendix D Theoretical analysis

D.1 Notations

Table 6: Notations and Definitions

Notation	Definition
$\mathcal{G}=(\mathcal{V},\mathcal{E})$	Original undirected graph, where $\mathcal{V}$ is the set of nodes and $\mathcal{E}$ is the set of edges.
$\mathcal{G}^{\prime}$	Perturbed graph obtained from $\mathcal{G}$ via local topological perturbations.
$n=\|\mathcal{V}\|$	Number of nodes in the graph.
$v\in\mathcal{V}$	A node in the graph.
$\mathcal{G}_{v}^{k}$	$k$ -hop subgraph around node $v$ in $\mathcal{G}$ .
$\mathcal{E}(\mathcal{G}_{v}^{k})$	Set of edges in the subgraph $\mathcal{G}_{v}^{k}$ .
$\|\mathcal{E}_{v}\|=\|\mathcal{E}(\mathcal{G}_{v}^{k})\|$	Number of edges in the subgraph $\mathcal{G}_{v}^{k}$ .
$n_{v}$	Number of nodes in the subgraph $\mathcal{G}_{v}^{k}$ .
$d_{v}$ , $d^{\prime}_{v}$	Degrees of node $v$ in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ , respectively.
$d_{\min}$ , $d_{\max}$	Minimum and maximum degrees in the $k$ -hop subgraphs.
$\mathbf{A}$ , $\mathbf{A}_{v}$	Adjacency matrix of $\mathcal{G}$ and the adjacency matrix of $\mathcal{G}_{v}^{k}$ , respectively.
$\mathbf{A}^{\prime}$ , $\mathbf{A^{\prime}}_{v}$	Adjacency matrix of $\mathcal{G^{\prime}}$ and the adjacency matrix of $\mathcal{G^{\prime}}_{v}^{k}$ , respectively.
$\mathbf{D}$ , $\mathbf{D}_{v}$	Degree matrix of $\mathcal{G}$ and the degree matrix of $\mathcal{G}_{v}^{k}$ , respectively.
$\mathbf{D}^{\prime}$ , $\mathbf{D^{\prime}}_{v}$	Degree matrix of $\mathcal{G^{\prime}}$ and the degree matrix of $\mathcal{G^{\prime}}_{v}^{k}$ , respectively.
$\tilde{\mathbf{A}}^{\prime}$ and $\tilde{\mathbf{A}^{\prime}}_{v}$	Normalized adjacency matrices of $\mathcal{G^{\prime}}$ and $\mathcal{G^{\prime}}_{v}^{k}$ , respectively.
$\mathbf{X}\in\mathbb{R}^{n\times d_{0}}$	Node feature matrix, where $d_{0}$ is the input feature dimension.
$k$	Number of layers in the GNN and the size of the $k$ -hop neighborhood.
$\mathbf{H}^{(l)}\in\mathbb{R}^{n\times d_{l}}$	Hidden representations at layer $l$ in the GNN.
$\mathbf{W}^{(l)}\in\mathbb{R}^{d_{l-1}\times d_{l}}$	Weight matrix at layer $l$ in the GNN, with $\left\\|\mathbf{W}^{(l)}\right\\|_{2}\leq L_{W}$ .
$L_{W}$	Upper bound on the spectral norm of the weight matrices.
$\mathbf{h}_{v}\in\mathbb{R}^{d_{k}}$	Embedding of node $v$ after $k$ GNN layers in $\mathcal{G}$ .
$\mathbf{h}^{\prime}_{v}\in\mathbb{R}^{d_{k}}$	Embedding of node $v$ after $k$ GNN layers in $\mathcal{G}^{\prime}$ .
$\mathbf{P}\in\mathbb{R}^{d_{k}\times d}$	Projection matrix applied to node embeddings to obtain final representations.
$\mathbf{z}_{v}=\mathbf{P}\mathbf{h}_{v}\in\mathbb{R}^{d}$	Final embedding of node $v$ in $\mathcal{G}$ after projection.
$\mathbf{z}^{\prime}_{v}=\mathbf{P}\mathbf{h}^{\prime}_{v}\in\mathbb{R}^{d}$	Final embedding of node $v$ in $\mathcal{G}^{\prime}$ after projection.
$d$	Embedding dimension of the final node representations.
$\tau$	Temperature parameter in the InfoNCE loss.
$\text{sim}(\mathbf{u},\mathbf{v})$	Cosine similarity between vectors $\mathbf{u}$ and $\mathbf{v}$ , defined as $\text{sim}(\mathbf{u},\mathbf{v})=\frac{\mathbf{u}^{\top}\mathbf{v}}{\left\\|\mathbf{u}\right\\|\left\\|\mathbf{v}\right\\|}$ .

D.2 Definitions and Preliminaries

Definition 1 (Local Topological Perturbation).

For a $k$ -layer GNN, the local topological perturbation strength $\delta$ is defined as the maximum fraction of edge changes in any node’s $k$ -hop neighborhood:

\delta=\max_{v\in\mathcal{V}}\frac{|\mathcal{E}(\mathcal{G}_{v}^{k})\triangle\mathcal{E}(\mathcal{G}^{\prime}_{v}{}^{k})|}{|\mathcal{E}(\mathcal{G}_{v}^{k})|},

(1)

where $\triangle$ denotes the symmetric difference of edge sets, and $\mathcal{G}^{\prime}$ is the perturbed graph.

Definition 2 (InfoNCE Loss).

For a pair of graphs $(\mathcal{G},\mathcal{G}^{\prime})$ , the InfoNCE loss is defined as:

\mathcal{L}_{\text{InfoNCE}}(\mathcal{G},\mathcal{G}^{\prime})=-\frac{1}{n}\sum_{v\in\mathcal{V}}\log\frac{\exp\left(\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)/\tau\right)}{\sum_{u\in\mathcal{V}}\exp\left(\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\right)/\tau\right)}

(2)

where $\mathbf{z}_{v}$ and $\mathbf{z}^{\prime}_{v}$ are embeddings of node $v$ in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ respectively, $\text{sim}(\cdot,\cdot)$ is cosine similarity, $\tau$ is a temperature parameter, and $n=|\mathcal{V}|$ .

D.3 Lemmas

Lemma 1 (Adjacency Matrix Perturbation).

Given perturbation strength $\delta$ , the change in adjacency matrices of the $k$ -hop subgraph around any node $v$ satisfies:

\left\|\mathbf{A}_{v}-\mathbf{A}^{\prime}_{v}\right\|_{F}\leq\sqrt{2\delta|\mathcal{E}_{v}|},

(3)

where $\mathcal{G}_{v}^{k}$ denote the $k$ -hop subgraph around node $v$ in the original graph $\mathcal{G}$ , with adjacency matrix $\mathbf{A}_{v}$ and degree matrix $\mathbf{D}_{v}$ . $|\mathcal{E}_{v}|$ is the number of edges in the $k$ -hop subgraph $\mathcal{G}_{v}^{k}$ . Similar notations for $\mathcal{G}_{v}^{k^{\prime}}$ , too.

Proof.

Each edge change affects two symmetric entries in the adjacency matrix $\mathbf{A}_{v}-\mathbf{A}^{\prime}_{v}$ , each with magnitude $1$ (since edges are undirected). Let $m$ be the number of edge changes within $\mathcal{G}_{v}^{k}$ . Then the Frobenius norm of the difference is:

\left\|\mathbf{A}_{v}-\mathbf{A}^{\prime}_{v}\right\|_{F}^{2}=\sum_{i,j}\left|A_{v,ij}-A^{\prime}_{v,ij}\right|^{2}=2m.

(4)

Since the number of edge changes $m\leq\delta|\mathcal{E}_{v}|$ , we have:

\left\|\mathbf{A}_{v}-\mathbf{A}^{\prime}_{v}\right\|_{F}\leq\sqrt{2\delta|\mathcal{E}_{v}|}.

(5)

∎

Lemma 2 (Degree Matrix Change).

For any node $v$ in $\mathcal{G}_{v}^{k}$ :

\left|d_{v}-d^{\prime}_{v}\right|\leq\delta d_{v}.

(6)

Moreover, for the degree matrices:

\left\|\mathbf{D}_{v}^{-1/2}-{\mathbf{D}_{v}^{\prime}}^{-1/2}\right\|_{F}\leq\frac{\delta\sqrt{n_{v}}}{2\sqrt{d_{\min}}(1-\delta)^{3/2}},

(7)

and

\left\|\mathbf{D}_{v}^{-1/2}-{\mathbf{D}_{v}^{\prime}}^{-1/2}\right\|_{2}\leq\frac{\delta}{2\sqrt{d_{\min}}(1-\delta)^{3/2}},

(8)

where $n_{v}$ is the number of nodes in the $k$ -hop subgraph, and $d_{\min}$ is the minimum degree in the subgraph.

Proof.

The degree of a node $v$ changes by at most $\delta d_{v}$ due to the perturbation:

\left|d_{v}-d^{\prime}_{v}\right|\leq\delta d_{v}.

(9)

Consider the function $f(x)=x^{-1/2}$ , which is convex for $x>0$ . Using the mean value theorem, for some $\xi_{v}$ between $d_{v}$ and $d^{\prime}_{v}$ :

d_{v}^{-1/2}-{d^{\prime}_{v}}^{-1/2}=f^{\prime}(\xi_{v})(d_{v}-d^{\prime}_{v})=-\frac{1}{2}\xi_{v}^{-3/2}(d_{v}-d^{\prime}_{v}).

(10)

Since $d^{\prime}_{v}\geq(1-\delta)d_{v}$ , we have $\xi_{v}\geq(1-\delta)d_{v}\geq(1-\delta)d_{\min}$ . Thus,

\left|d_{v}^{-1/2}-{d^{\prime}_{v}}^{-1/2}\right|\leq\frac{\delta d_{v}}{2((1-\delta)d_{\min})^{3/2}}=\frac{\delta d_{v}}{2(1-\delta)^{3/2}d_{\min}^{3/2}}.

(11)

Since $d_{v}\leq d_{\max}$ , and $d_{\min}\leq d_{v}$ , we have:

\left|d_{v}^{-1/2}-{d^{\prime}_{v}}^{-1/2}\right|\leq\frac{\delta d_{\max}}{2(1-\delta)^{3/2}d_{\min}^{3/2}}\leq\frac{\delta}{2\sqrt{d_{\min}}(1-\delta)^{3/2}}.

(12)

The Frobenius norm is computed as:

\left\|\mathbf{D}_{v}^{-1/2}-{\mathbf{D}_{v}^{\prime}}^{-1/2}\right\|_{F}^{2}=\sum_{v}\left|d_{v}^{-1/2}-{d^{\prime}_{v}}^{-1/2}\right|^{2}\leq n_{v}\left(\frac{\delta}{2\sqrt{d_{\min}}(1-\delta)^{3/2}}\right)^{2}.

(13)

Therefore,

\left\|\mathbf{D}_{v}^{-1/2}-{\mathbf{D}_{v}^{\prime}}^{-1/2}\right\|_{F}\leq\frac{\delta\sqrt{n_{v}}}{2\sqrt{d_{\min}}(1-\delta)^{3/2}}.

(14)

Similarly, the spectral norm bound is:

\left\|\mathbf{D}_{v}^{-1/2}-{\mathbf{D}_{v}^{\prime}}^{-1/2}\right\|_{2}\leq\frac{\delta}{2\sqrt{d_{\min}}(1-\delta)^{3/2}}.

(15)

∎

Lemma 3 (Bounded Change in Normalized Adjacency Matrix).

Given a graph $\mathcal{G}$ with minimum degree $d_{\min}$ , maximum degree $d_{\max}$ , and $n_{v}$ nodes in the $k$ -hop subgraph, and its perturbation $\mathcal{G}^{\prime}$ with local topological perturbation strength $\delta$ , the change in the normalized adjacency matrix for any $k$ -hop subgraph is bounded by:

\left\|\tilde{\mathbf{A}}_{v}-\tilde{\mathbf{A}}^{\prime}_{v}\right\|_{F}\leq\frac{\sqrt{n_{v}d_{\max}}}{d_{\min}}\left(\sqrt{\delta}+\frac{\delta}{(1-\delta)^{3/2}}\right).

(16)

Proof.

We start by noting that the normalized adjacency matrix is given by $\tilde{\mathbf{A}}_{v}=\mathbf{D}_{v}^{-1/2}\mathbf{A}_{v}\mathbf{D}_{v}^{-1/2}$ . The difference between the normalized adjacency matrices is:

\tilde{\mathbf{A}}_{v}-\tilde{\mathbf{A}}^{\prime}_{v}=\mathbf{D}_{v}^{-1/2}\mathbf{A}_{v}\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{\prime-1/2}.

(17)

Add and subtract $\mathbf{D}_{v}^{-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{-1/2}$ :

\tilde{\mathbf{A}}_{v}-\tilde{\mathbf{A}}^{\prime}_{v}=\mathbf{D}_{v}^{-1/2}(\mathbf{A}_{v}-\mathbf{A}_{v}^{\prime})\mathbf{D}_{v}^{-1/2}+(\mathbf{D}_{v}^{-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{\prime-1/2}).

(18)

Let $\mathbf{E}=\mathbf{D}_{v}^{-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{\prime-1/2}$ . Then,

\tilde{\mathbf{A}}_{v}-\tilde{\mathbf{A}}^{\prime}_{v}=\mathbf{D}_{v}^{-1/2}(\mathbf{A}_{v}-\mathbf{A}_{v}^{\prime})\mathbf{D}_{v}^{-1/2}+\mathbf{E}.

(19)

First, we bound the first term:

\left\|\mathbf{D}_{v}^{-1/2}(\mathbf{A}_{v}-\mathbf{A}_{v}^{\prime})\mathbf{D}_{v}^{-1/2}\right\|_{F}\leq\left\|\mathbf{D}_{v}^{-1/2}\right\|_{2}^{2}\left\|\mathbf{A}_{v}-\mathbf{A}_{v}^{\prime}\right\|_{F}\leq\frac{1}{d_{\min}}\sqrt{2\delta|\mathcal{E}_{v}|}.

(20)

Since $|\mathcal{E}_{v}|\leq\frac{1}{2}n_{v}d_{\max}$ , we have:

\sqrt{2\delta|\mathcal{E}_{v}|}\leq\sqrt{2\delta\left(\frac{1}{2}n_{v}d_{\max}\right)}=\sqrt{\delta n_{v}d_{\max}}.

(21)

Thus,

\left\|\mathbf{D}_{v}^{-1/2}(\mathbf{A}_{v}-\mathbf{A}_{v}^{\prime})\mathbf{D}_{v}^{-1/2}\right\|_{F}\leq\frac{\sqrt{\delta n_{v}d_{\max}}}{d_{\min}}.

(22)

Next, we bound $\left\|\mathbf{E}\right\|_{F}$ . Note that:

\mathbf{E}=(\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2})\mathbf{A}_{v}^{\prime}\mathbf{D}_{v}^{-1/2}+\mathbf{D}_{v}^{\prime-1/2}\mathbf{A}_{v}^{\prime}(\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}).

(23)

Therefore,

\left\|\mathbf{E}\right\|_{F}\leq 2\left\|\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}\right\|_{2}\left\|\mathbf{A}_{v}^{\prime}\right\|_{F}\left\|\mathbf{D}_{v}^{-1/2}\right\|_{2}.

(24)

Since $\left\|\mathbf{A}_{v}^{\prime}\right\|_{F}\leq\sqrt{2|\mathcal{E}_{v}|}\leq\sqrt{n_{v}d_{\max}}$ , $\left\|\mathbf{D}_{v}^{-1/2}\right\|_{2}\leq\frac{1}{\sqrt{d_{\min}}}$ , and using the bound from Lemma 2 for $\left\|\mathbf{D}_{v}^{-1/2}-\mathbf{D}_{v}^{\prime-1/2}\right\|_{2}$ , we have:

\left\|\mathbf{E}\right\|_{F}\leq 2\times\frac{\delta}{2\sqrt{d_{\min}}(1-\delta)^{3/2}}\times\sqrt{n_{v}d_{\max}}\times\frac{1}{\sqrt{d_{\min}}}=\frac{\delta\sqrt{n_{v}d_{\max}}}{d_{\min}(1-\delta)^{3/2}}.

(25)

Combining both terms:

\left\|\tilde{\mathbf{A}}_{v}-\tilde{\mathbf{A}}^{\prime}_{v}\right\|_{F}\leq\frac{\sqrt{\delta n_{v}d_{\max}}}{d_{\min}}+\frac{\delta\sqrt{n_{v}d_{\max}}}{d_{\min}(1-\delta)^{3/2}}=\frac{\sqrt{n_{v}d_{\max}}}{d_{\min}}\left(\sqrt{\delta}+\frac{\delta}{(1-\delta)^{3/2}}\right).

(26)

∎

Lemma 4 (GNN Output Difference Bound).

For a $k$ -layer Graph Neural Network (GNN) $f_{\theta}$ with ReLU activation functions and weight matrices satisfying $\left\|\mathbf{W}^{(l)}\right\|_{2}\leq L_{W}$ for all layers $l$ , given two graphs $\mathcal{G}$ and $\mathcal{G}^{\prime}$ with local topological perturbation strength $\delta$ , the difference in GNN outputs for any node $v$ is bounded by:

\left\|\mathbf{h}_{v}-\mathbf{h}^{\prime}_{v}\right\|\leq k\left(AL_{W}\right)^{k}B\left\|\mathbf{X}\right\|_{2},

(27)

where:

•

$\mathbf{h}_{v}$ and $\mathbf{h}^{\prime}_{v}$ are the embeddings of node $v$ in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ , respectively, after $k$ GNN layers.
•

$\mathbf{X}$ is the node feature matrix.
•

$A=\dfrac{\sqrt{n_{v}d_{\max}}}{d_{\min}}$ .
•

$B=\sqrt{\delta}+\dfrac{\delta}{(1-\delta)^{3/2}}$ .
•

$n_{v}$ is the number of nodes in the $k$ -hop subgraph around node $v$ .
•

$d_{\min}$ and $d_{\max}$ are the minimum and maximum degrees in the subgraph.

Proof.

We will prove the lemma by induction on the number of layers $l$ .

Base Case ( $l=0$ ).

At layer $l=0$ , before any GNN layers are applied, the embeddings are simply the input features:

\mathbf{H}^{(0)}=\mathbf{X},\quad\mathbf{H}^{\prime(0)}=\mathbf{X}.

(28)

Thus,

\left\|\mathbf{H}^{(0)}-\mathbf{H}^{\prime(0)}\right\|_{F}=0.

(29)

This establishes the base case.

Inductive Step.

Assume that for some $l\geq 0$ , the following bound holds:

\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{F}\leq l\left(AL_{W}\right)^{l}B\left\|\mathbf{X}\right\|_{2}.

(30)

Our goal is to show that the bound holds for layer $l+1$ :

\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}\leq(l+1)\left(AL_{W}\right)^{l+1}B\left\|\mathbf{X}\right\|_{2}.

(31)

The outputs at layer $(l+1)$ are:

\mathbf{H}^{(l+1)}=\text{ReLU}\left(\tilde{\mathbf{A}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right),\quad\mathbf{H}^{\prime(l+1)}=\text{ReLU}\left(\tilde{\mathbf{A}}^{\prime}\mathbf{H}^{\prime(l)}\mathbf{W}^{(l)}\right),

(32)

where:

•

$\tilde{\mathbf{A}}$ and $\tilde{\mathbf{A}}^{\prime}$ are the normalized adjacency matrices of the $k$ -hop subgraphs around node $v$ in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ , respectively.
•

$\mathbf{W}^{(l)}$ is the weight matrix of layer $l$ , with $\left\|\mathbf{W}^{(l)}\right\|_{2}\leq L_{W}$ .

Since ReLU is 1-Lipschitz, we have:

\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}\leq\left\|\tilde{\mathbf{A}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}-\tilde{\mathbf{A}}^{\prime}\mathbf{H}^{\prime(l)}\mathbf{W}^{(l)}\right\|_{F}.

(33)

We can expand the difference as:

\tilde{\mathbf{A}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}-\tilde{\mathbf{A}}^{\prime}\mathbf{H}^{\prime(l)}\mathbf{W}^{(l)}=\underbrace{\tilde{\mathbf{A}}\left(\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right)\mathbf{W}^{(l)}}_{T_{1}}+\underbrace{\left(\tilde{\mathbf{A}}-\tilde{\mathbf{A}}^{\prime}\right)\mathbf{H}^{\prime(l)}\mathbf{W}^{(l)}}_{T_{2}}.

(34)

Bounding $T_{1}$ .

Using the submultiplicative property of norms:

\left\|T_{1}\right\|_{F}\leq\left\|\tilde{\mathbf{A}}\right\|_{F}\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{2}\left\|\mathbf{W}^{(l)}\right\|_{2}.

(35)

From properties of $\tilde{\mathbf{A}}$ and $\mathbf{W}^{(l)}$ :

\left\|\tilde{\mathbf{A}}\right\|_{F}\leq A,\quad\text{(as shown below)},\quad\left\|\mathbf{W}^{(l)}\right\|_{2}\leq L_{W}.

(36)

Also, since $\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{2}\leq\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{F}$ , we have:

\left\|T_{1}\right\|_{F}\leq AL_{W}\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{F}.

(37)

Bounding $\left\|\tilde{\mathbf{A}}\right\|_{F}$ .

The entries of $\tilde{\mathbf{A}}$ are:

\tilde{A}_{ij}=\frac{A_{ij}}{\sqrt{d_{i}d_{j}}},

(38)

where $A_{ij}\in\{0,1\}$ , and $d_{i},d_{j}\geq d_{\min}$ . Therefore,

|\tilde{A}_{ij}|\leq\frac{1}{d_{\min}}.

(39)

The number of non-zero entries in $\tilde{\mathbf{A}}$ is at most $n_{v}d_{\max}$ . Therefore,

\left\|\tilde{\mathbf{A}}\right\|_{F}\leq\frac{\sqrt{n_{v}d_{\max}}}{d_{\min}}=A.

(40)

Bounding $T_{2}$ .

Similarly, we have:

\left\|T_{2}\right\|_{F}\leq\left\|\tilde{\mathbf{A}}-\tilde{\mathbf{A}}^{\prime}\right\|_{F}\left\|\mathbf{H}^{\prime(l)}\right\|_{2}\left\|\mathbf{W}^{(l)}\right\|_{2}.

(41)

From the perturbation analysis:

\left\|\tilde{\mathbf{A}}-\tilde{\mathbf{A}}^{\prime}\right\|_{F}\leq AB.

(42)

To bound $\left\|\mathbf{H}^{\prime(l)}\right\|_{2}$ , we note that:

\left\|\mathbf{H}^{\prime(l)}\right\|_{2}\leq\left\|\mathbf{H}^{\prime(l)}\right\|_{F}.

(43)

We can bound $\left\|\mathbf{H}^{\prime(l)}\right\|_{F}$ recursively.

Bounding $\left\|\mathbf{H}^{\prime(l)}\right\|_{F}$ .

At each layer, the output is given by:

\mathbf{H}^{\prime(l)}=\text{ReLU}\left(\tilde{\mathbf{A}}^{\prime}\mathbf{H}^{\prime(l-1)}\mathbf{W}^{(l-1)}\right).

(44)

Since ReLU is 1-Lipschitz and $\left\|\tilde{\mathbf{A}}^{\prime}\right\|_{F}\leq A$ , we have:

\left\|\mathbf{H}^{\prime(l)}\right\|_{F}\leq\left\|\tilde{\mathbf{A}}^{\prime}\mathbf{H}^{\prime(l-1)}\mathbf{W}^{(l-1)}\right\|_{F}\leq AL_{W}\left\|\mathbf{H}^{\prime(l-1)}\right\|_{2}.

(45)

Recursively applying this bound from $l=0$ to $l$ , and noting that $\left\|\mathbf{H}^{\prime(0)}\right\|_{2}=\left\|\mathbf{X}\right\|_{2}$ , we obtain:

\left\|\mathbf{H}^{\prime(l)}\right\|_{F}\leq\left(AL_{W}\right)^{l}\left\|\mathbf{X}\right\|_{2}.

(46)

Therefore,

\left\|\mathbf{H}^{\prime(l)}\right\|_{2}\leq\left(AL_{W}\right)^{l}\left\|\mathbf{X}\right\|_{2}.

(47)

Now we have:

\left\|T_{2}\right\|_{F}\leq AB\left(AL_{W}\right)^{l}\left\|\mathbf{X}\right\|_{2}L_{W}=A^{l+1}BL_{W}^{l+1}\left\|\mathbf{X}\right\|_{2}.

(48)

Total Bound for $\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}$ .

Combining $T_{1}$ and $T_{2}$ :

\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}\leq AL_{W}\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{F}+A^{l+1}BL_{W}^{l+1}\left\|\mathbf{X}\right\|_{2}.

(49)

Recursive Relation.

Let $C_{l}=\left\|\mathbf{H}^{(l)}-\mathbf{H}^{\prime(l)}\right\|_{F}$ . The recursive relation is:

C_{l+1}\leq AL_{W}C_{l}+A^{l+1}BL_{W}^{l+1}\left\|\mathbf{X}\right\|_{2}.

(50)

We will prove by induction that:

C_{l}\leq l\left(AL_{W}\right)^{l}B\left\|\mathbf{X}\right\|_{2}.

(51)

Base Case.

For $l=0$ , $C_{0}=0$ , which satisfies the bound.

Inductive Step.

Assume the bound holds for $l$ :

C_{l}\leq l\left(AL_{W}\right)^{l}B\left\|\mathbf{X}\right\|_{2}.

(52)

Then for $l+1$ :

$\displaystyle C_{l+1}$	$\displaystyle\leq AL_{W}C_{l}+A^{l+1}BL_{W}^{l+1}\left\\|\mathbf{X}\right\\|_{2}$	(53)
	$\displaystyle\leq AL_{W}\left(l\left(AL_{W}\right)^{l}B\left\\|\mathbf{X}\right\\|_{2}\right)+A^{l+1}BL_{W}^{l+1}\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=lA^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}+A^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=(l+1)A^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=(l+1)\left(AL_{W}\right)^{l+1}B\left\\|\mathbf{X}\right\\|_{2}.$

This confirms that the bound holds for $l+1$ .

For $l=k$ , we have:

\left\|\mathbf{H}^{(k)}-\mathbf{H}^{\prime(k)}\right\|_{F}\leq k\left(AL_{W}\right)^{k}B\left\|\mathbf{X}\right\|_{2}.

(54)

Since $\left\|\mathbf{h}_{v}-\mathbf{h}^{\prime}_{v}\right\|\leq\left\|\mathbf{H}^{(k)}-\mathbf{H}^{\prime(k)}\right\|_{F}$ , we obtain:

\left\|\mathbf{h}_{v}-\mathbf{h}^{\prime}_{v}\right\|\leq k\left(AL_{W}\right)^{k}B\left\|\mathbf{X}\right\|_{2}.

(55)

This completes the proof. ∎

Lemma 5 (Minimum Cosine Similarity for Positive Pairs).

For embeddings $\mathbf{z}_{v}$ and $\mathbf{z}^{\prime}_{v}$ produced by a linear projection of GNN outputs, with $\left\|\mathbf{z}_{v}\right\|=\left\|\mathbf{z}^{\prime}_{v}\right\|=1$ , the cosine similarity satisfies:

\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)\geq 1-\frac{\epsilon^{2}}{2},

(56)

where

\epsilon=\frac{k(\dfrac{\sqrt{n_{v}d_{\max}}}{d_{\min}})^{k}\left(\sqrt{\delta}+\dfrac{\delta}{(1-\delta)^{3/2}}\right)L_{W}^{k}\left\|\mathbf{X}\right\|_{2}\left\|\mathbf{P}\right\|_{2}}{c_{z}},

(57)

and $c_{z}$ is the lower bound on $\left\|\mathbf{z}_{v}\right\|$ (which equals $1$ in this case).

Proof.

The embeddings are computed as $\mathbf{z}_{v}=\mathbf{P}\mathbf{h}_{v}$ and $\mathbf{z}^{\prime}_{v}=\mathbf{P}\mathbf{h}^{\prime}_{v}$ . Then,

\left\|\mathbf{z}_{v}-\mathbf{z}^{\prime}_{v}\right\|\leq\left\|\mathbf{P}\right\|_{2}\left\|\mathbf{h}_{v}-\mathbf{h}^{\prime}_{v}\right\|.

(58)

Using the bound from Lemma 4, we have:

\left\|\mathbf{z}_{v}-\mathbf{z}^{\prime}_{v}\right\|\leq\left(\frac{\sqrt{n_{v}d_{\max}}}{d_{\min}}\left(\sqrt{\delta}+\frac{\delta}{(1-\delta)^{3/2}}\right)L_{W}^{k}\left\|\mathbf{X}\right\|_{2}\left\|\mathbf{P}\right\|_{2}\right).

(59)

Since $\left\|\mathbf{z}_{v}\right\|=\left\|\mathbf{z}^{\prime}_{v}\right\|=1$ , the cosine similarity satisfies:

\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)=1-\frac{1}{2}\left\|\mathbf{z}_{v}-\mathbf{z}^{\prime}_{v}\right\|^{2}\geq 1-\frac{\epsilon^{2}}{2}.

(60)

∎

Lemma 6 (Refined Negative Pair Similarity Bound).

Assuming that embeddings of different nodes are approximately independent and randomly oriented in high-dimensional space, and that the embedding dimension $d$ satisfies $d\gg\log n$ , we have, with high probability:

\left|\text{sim}(\mathbf{z}_{v},\mathbf{z}^{\prime}_{u})\right|\leq\epsilon^{\prime},

(61)

where

\epsilon^{\prime}=\sqrt{\dfrac{2\log n}{d}}.

(62)

Proof.

Since $\mathbf{z}_{v}$ and $\mathbf{z}^{\prime}_{u}$ are unit vectors in $\mathbb{R}^{d}$ and approximately independent for $u\neq v$ , the inner product $\langle\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\rangle$ follows a distribution with mean zero and variance $\dfrac{1}{d}$ . By applying concentration inequalities such as Hoeffding’s inequality or the Gaussian tail bound, for any $\epsilon^{\prime}>0$ :

P\left(\left|\langle\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\rangle\right|\geq\epsilon^{\prime}\right)\leq 2\exp\left(-\dfrac{d(\epsilon^{\prime})^{2}}{2}\right).

(63)

Selecting $\epsilon^{\prime}=\sqrt{\dfrac{2\log n}{d}}$ , we get:

P\left(\left|\langle\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\rangle\right|\geq\sqrt{\dfrac{2\log n}{d}}\right)\leq\frac{2}{n}.

(64)

Using the union bound over all $n(n-1)$ pairs, the probability that any pair violates this bound is small when $d\gg\log n$ . ∎

D.4 Main Theorem

Theorem 1 (InfoNCE Loss Bounds).

-\log\left(\frac{e^{1/\tau}}{e^{1/\tau}+(n-1)e^{-\epsilon^{\prime}/\tau}}\right)\leq\mathcal{L}_{\text{InfoNCE}}(\mathcal{G},\mathcal{G}^{\prime})\leq-\log\left(\frac{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}}{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}+(n-1)e^{\epsilon^{\prime}/\tau}}\right),

(65)

where $\epsilon$ is as defined in Lemma 5 and $\epsilon^{\prime}$ is as defined in Lemma 6.

Proof.

For the positive pairs (same node in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ ), from Lemma 5:

\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)\geq 1-\frac{\epsilon^{2}}{2}.

(66)

For the negative pairs (different nodes), from Lemma 6, with high probability:

\left|\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\right)\right|\leq\epsilon^{\prime},\quad\forall u\neq v.

(67)

The InfoNCE loss for node $v$ is:

\mathcal{L}_{v}=-\log\frac{\exp\left(\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)/\tau\right)}{\exp\left(\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{v}\right)/\tau\right)+\sum_{u\neq v}\exp\left(\text{sim}\left(\mathbf{z}_{v},\mathbf{z}^{\prime}_{u}\right)/\tau\right)}.

(68)

For the upper bound on $\mathcal{L}_{v}$ , we use the minimal positive similarity and maximal negative similarity:

\mathcal{L}_{v}\leq-\log\frac{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}}{e^{\left(1-\dfrac{\epsilon^{2}}{2}\right)/\tau}+(n-1)e^{\epsilon^{\prime}/\tau}}.

(69)

For the lower bound on $\mathcal{L}_{v}$ , we use the maximal positive similarity and minimal negative similarity:

\mathcal{L}_{v}\geq-\log\frac{e^{1/\tau}}{e^{1/\tau}+(n-1)e^{-\epsilon^{\prime}/\tau}}.

(70)

Since this holds for all nodes $v$ , averaging over all nodes, we obtain the bounds for $\mathcal{L}_{\text{InfoNCE}}(\mathcal{G},\mathcal{G}^{\prime})$ . ∎

D.5 Numerical Estimation

To assess how tight the bound is while keeping $d_{\min}$ not too large (e.g., $d_{\min}=10$ ), let’s perform a numerical estimation.

Suppose:

•

Number of nodes: $n=1000$ .
•

Embedding dimension: $d=4096$ .
•

Minimum degree: $d_{\min}=10$ .
•

Maximum degree: $d_{\max}=30$ .
•

Layer count: $k=1$ .
•

Weight matrix norm: $L_{W}=0.5$ .
•

Input feature norm: $\|\mathbf{X}\|_{2}=1$ .
•

Projection matrix norm: $\|\mathbf{P}\|_{2}=1$ .
•

Temperature: $\tau=0.5$ .
•

Local perturbation strength: $\delta=0.1$ .

Compute $\epsilon$ :

Assuming $n_{v}\approx 30$ ,

	$\displaystyle\sqrt{n_{v}d_{\max}}$	$\displaystyle=\sqrt{30\times 30}=\sqrt{900}=30,$
	$\displaystyle\frac{\sqrt{n_{v}d_{\max}}}{d_{\min}}$	$\displaystyle=\frac{30}{10}=3,$
	$\displaystyle\sqrt{\delta}$	$\displaystyle=\sqrt{0.1}\approx 0.316228,$
	$\displaystyle\frac{\delta}{(1-\delta)^{3/2}}$	$\displaystyle\approx\frac{0.1}{(1-0.1)^{1.5}}\approx\frac{0.1}{0.853814}\approx 0.117121,$
	$\displaystyle\sqrt{\delta}+\frac{\delta}{(1-\delta)^{3/2}}$	$\displaystyle\approx 0.316228+0.117121=0.433349,$
	$\displaystyle\epsilon$	$\displaystyle=1\times 3\times 0.433349\times 0.5\times 1\times 1\approx 0.650.$

Compute $\epsilon^{\prime}$ :

\epsilon^{\prime}=\sqrt{\dfrac{2\log n}{d}}=\sqrt{\dfrac{13.8155}{4096}}\approx\sqrt{0.003374}\approx 0.05805.

Compute exponents:

For the upper bound:

\frac{1-\dfrac{\epsilon^{2}}{2}}{\tau}=1.5775,\quad\frac{\epsilon^{\prime}}{\tau}=\frac{0.05805}{0.5}=0.1161.

For the lower bound:

\frac{1}{\tau}=2,\quad\frac{-\epsilon^{\prime}}{\tau}=-0.1161.

Compute numerator and denominator for the upper bound:

	Numerator	$\displaystyle\approx 4.8426,$
	Denominator	$\displaystyle\approx 1126.8881.$

Compute numerator and denominator for the lower bound:

	Numerator	$\displaystyle\approx 7.3891,$
	Denominator	$\displaystyle\approx 896.5990.$

Compute the InfoNCE loss bounds:

	$\displaystyle\mathcal{L}_{\text{upper}}$	$\displaystyle=-\log\left(\frac{4.8426}{1126.8881}\right)=-\log(0.003964)\approx 5.4497,$
	$\displaystyle\mathcal{L}_{\text{lower}}$	$\displaystyle=-\log\left(\frac{7.3891}{896.5990}\right)=-\log(0.00824)\approx 4.7989.$

Interpretation.

The numerical gap between the upper and lower bounds, calculated as $5.4497-4.7989=0.6508$ , is notably narrow. This tight interval highlights a key observation: shallow GNNs face intrinsic challenges in effectively exploiting spectral enhancement techniques. This is due to their restricted capacity to represent and process the spectral characteristics of a graph, irrespective of the complexity of the spectral modifications applied. The findings suggest that tuning fundamental augmentation parameters, such as perturbation strength, may exert a more pronounced influence on learning outcomes than intricate spectral alterations. While the theoretical rationale behind spectral augmentations is well-motivated, their practical utility might only be realized when paired with deeper GNNs capable of leveraging augmented spectral information across multiple layers of message propagation.

Appendix E More experiments

E.1 Effect of numbers of GCN Layers

We explore the impact of GCN depth on accuracy by testing GCNs with 4, 6, and 8 layers, using our edge perturbation methods alongside SPAN baselines. Experiments were conducted with the GRACE and G-BT frameworks on the Cora dataset for node classification and the MUTAG dataset for graph classification. Each configuration was run three times, with the mean accuracy and standard deviation reported.

Overall, deeper GCNs (6 and 8 layers) tend to perform worse across both tasks, reinforcing the observation that deeper architectures, despite their theoretical expressive power, may negatively impact the quality of learned representations. The results are summarized in Tables 7 and 8.

Table 7: Impact of GCN depth on node classification task on the Cora dataset. The best result of each column is in grey. Metric is accuracy (%).

Model	4	6	8
GBT+DropEdge	83.53 $\pm$ 1.48	82.06 $\pm$ 3.45	80.88 $\pm$ 1.38
GBT +AddEdge	81.99 $\pm$ 0.79	79.04 $\pm$ 1.59	79.41 $\pm$ 1.98
GBT+SPAN	80.39 $\pm$ 2.17	81.25 $\pm$ 1.67	79.41 $\pm$ 1.87
GRACE+DropEdge	82.35 $\pm$ 1.08	82.47 $\pm$ 1.35	81.74 $\pm$ 2.42
GRACE +AddEdge	79.17 $\pm$ 1.35	78.80 $\pm$ 0.96	81.00 $\pm$ 0.17
GRACE+SPAN	80.15 $\pm$ 0.30	80.15 $\pm$ 0.79	75.98 $\pm$ 1.54

Table 8: Impact of GCN depth on graph classification task on the MUTAG dataset. The best result of each column is in grey. Metric is accuracy (%).

Model	4	6	8
GBT+DropEdge	90.74 $\pm$ 2.61	88.88 $\pm$ 4.53	88.88 $\pm$ 7.85
GBT +AddEdge	94.44 $\pm$ 0.00	94.44 $\pm$ 4.53	94.44 $\pm$ 4.53
GBT+SPAN	94.44 $\pm$ 4.53	92.59 $\pm$ 2.61	90.74 $\pm$ 2.61
GRACE+DropEdge	94.44 $\pm$ 0.00	90.74 $\pm$ 2.61	90.74 $\pm$ 2.61
GRACE +AddEdge	92.59 $\pm$ 5.23	94.44 $\pm$ 4.53	94.44 $\pm$ 0.00
GRACE+SPAN	90.74 $\pm$ 2.61	90.74 $\pm$ 5.23	88.88 $\pm$ 7.85

E.2 Effect of GNN encoder

To further validate the generality of our approach, we conducted additional experiments using different GNN encoders. For the node classification task, we evaluated the Cora dataset with GAT as the encoder, while for the graph classification task, we performed experiments on the MUTAG dataset using both GAT and GPS as encoders.

The results, presented in Tables 9 and 10, are shown alongside the results obtained with GCN encoders. These findings demonstrate that our simple edge perturbation method consistently outperforms the baselines, regardless of the choice of the encoder. This confirms that our conclusions hold across different encoder architectures, underscoring the robustness and effectiveness of the proposed approach.

Table 9: Accuracy of node classification with different GNN encoders on Cora dataset. The best result of each column is in grey. Metric is accuracy (%).

Model	GCN	GAT
MVGRL+SPAN	84.57 $\pm$ 0.22	82.90 $\pm$ 0.86
MVGRL+DropEdge	84.31 $\pm$ 1.95	83.21 $\pm$ 1.41
MVGRL +AddEdge	83.21 $\pm$ 1.65	83.33 $\pm$ 0.17
GBT+SPAN	82.84 $\pm$ 0.90	83.47 $\pm$ 0.39
GBT + DropEdge	84.19 $\pm$ 2.07	84.06 $\pm$ 1.05
GBT + AddEdge	85.78 $\pm$ 0.62	81.49 $\pm$ 0.45
GRACE + SPAN	82.84 $\pm$ 0.91	82.74 $\pm$ 0.47
GRACE + DropEdge	84.19 $\pm$ 2.07	82.84 $\pm$ 2.58
GRACE + AddEdge	85.78 $\pm$ 0.62	82.84 $\pm$ 1.21
BGRL + SPAN	83.33 $\pm$ 0.45	82.59 $\pm$ 0.79
BGRL + DropEdge	83.21 $\pm$ 3.29	80.88 $\pm$ 1.08
BGRL + AddEdge	81.49 $\pm$ 1.21	82.23 $\pm$ 2.00

Table 10: Accuracy of graph classification with different GNN encoders on MUTAG dataset. The best result of each column is in grey. Metric is accuracy (%).

Model	GCN	GAT	GPS
MVGRL+SPAN	93.33 $\pm$ 2.22	96.29 $\pm$ 2.61	94.44 $\pm$ 0.00
MVGRL+DropEdge	93.33 $\pm$ 2.22	92.22 $\pm$ 3.68	96.26 $\pm$ 5.23
MVGRL +AddEdge	94.44 $\pm$ 3.51	94.44 $\pm$ 6.57	95.00 $\pm$ 5.24
GBT+SPAN	90.00 $\pm$ 6.47	94.44 $\pm$ 4.53	90.74 $\pm$ 5.23
GBT + DropEdge	92.59 $\pm$ 2.61	94.44 $\pm$ 4.53	94.44 $\pm$ 4.53
GBT + AddEdge	92.59 $\pm$ 2.61	92.59 $\pm$ 2.61	94.44 $\pm$ 4.53
GRACE + SPAN	90.00 $\pm$ 4.15	96.29 $\pm$ 2.61	92.59 $\pm$ 2.61
GRACE + DropEdge	88.88 $\pm$ 3.51	94.44 $\pm$ 0.00	94.44 $\pm$ 4.53
GRACE + AddEdge	92.22 $\pm$ 4.22	96.29 $\pm$ 2.61	94.44 $\pm$ 0.00
BGRL + SPAN	90.00 $\pm$ 4.15	94.44 $\pm$ 4.53	94.44 $\pm$ 0.00
BGRL + DropEdge	88.88 $\pm$ 4.96	90.74 $\pm$ 4.54	92.59 $\pm$ 5.23
BGRL + AddEdge	91.11 $\pm$ 5.66	96.29 $\pm$ 2.61	96.29 $\pm$ 2.61

E.3 Relationship between spectral cues and performance of EP

Based on the findings obtained from Sec 7.1, it is very likely that spectral information can not be distinguishable enough for good representation learning on the graph. But to more directly answer the question of whether spectral cues and information play an important role in the learning performance of EP, we continue to conduct a statistical analysis to evaluate the influence of various factors on the learning performance. The results turn out to be consistent with our claim that spectral cues are insignificant aspects of outstanding performance on accuracy observed in Sec. 6.

E.3.1 Statistical analyses on key factors on performance of EP

From a statistical angle, we have a few dimensions of factors that can possibly influence learning performance, like the parameters of EP (i.e. drop rate $p$ in DropEdge or add rate $q$ in AddEdge) as well as potential spectral cues lying in the argument graphs. Therefore, to rule out the possibility that spectral cues and information are significant, comparisons are conducted on the impact of the parameters of EP in the augmentations versus:

1.

The average $L_{2}$ -distance between the spectrum of the original graph (OG) and that of each augmented graph (AUG) which is introduced by EP augmentations, denoted as OG-AUG.
2.

The average $L_{2}$ -distance between the spectra of a pair of augmented graphs appearing in the same learning epoch when having a two-way contrastive learning framework, like G-BT, denoted as AUG-AUG.

Two statistical analyses have been carried out to argue that the former is a more critical determinant and a more direct cause of the model efficacy. Each analysis was chosen for its ability to effectively dissect and compare the impact of edge perturbation parameters versus spectral changes.

Due to the high cost of calculating the spectrum of all AUGs in each epoch and the stability of the spectrum of the node-level dataset (as the original graph is fixed in the experiment), we perform this experiment on the contrastive framework and augmentation methods with the best performance in the study, i.e. G-BT with DropEdge on node-level classification. Also, we choose the small datasets, Cora for analysis. Note that the smaller the graph, the higher the probability that the spectrum distance has a significant influence on the graph topology.

Analysis 1: Polynomial Regression.

Polynomial regression was utilized to directly model the relationship between the test accuracy of the model and the average spectral distances introduced by EP. This method captures the linear, or non-linear influences that these spectral distances may exert on the learning outcomes, thereby providing insight into how different parameters affect model performance.

Table 11: Polynomial regression of node-level accuracy over drop rate

p

in DropEdge, average spectral distance between OG and AUG (OG-AUG), and average spectral distance between AUG pairs (AUG-AUG). The method is G-BT and the dataset is Cora. The best results are in grey.

Order of the regression	Regressor	R-squared $\uparrow$	Adj. R-squared $\uparrow$	F-statistic $\uparrow$	P-value $\downarrow$
1 (i.e. linear)	Drop rate $p$	0.628	0.621	81.12	6.94e-12
	OG-AUG	0.388	0.375	30.45	1.35e-06
	AUG-AUG	0.338	0.325	24.55	9.39e-06
2 (i.e. quadratic)	Drop rate $p$	0.844	0.837	126.9	1.14e-19
	OG-AUG	0.721	0.709	60.78	9.23e-14
	AUG-AUG	0.597	0.580	34.88	5.16e-10

The polynomial regression analysis in Table 11 highlights that the drop rate $p$ is the primary factor influencing model performance, showing strong and significant linear and non-linear relationships with test accuracy. In contrast, both the OG-AUG and AUG-AUG spectral distances have relatively minor impacts on performance, indicating that they are not significant determinants of the model’s efficacy.

Analysis 2: Instrumental Variable Regression.

To study the causal relationship, we perform an Instrumental Variable Regression (IVR) to rigorously evaluate the influence of spectral information and edge perturbation parameters on the performance of CG-SSL models. Specifically, we employ a Two-Stage Least Squares (IV2SLS) method to address potential endogeneity issues and obtain unbiased estimates of the causal effects.

In IV2SLS analysis, we define the variables as follows:

•

Y (Dependent Variable): The outcome we aim to explain or predict, which in this case is the performance of the SSL model.
•

X (Explanatory Variable): The variable that we believe directly influences Y. It is the primary factor whose effect on Y we want to measure.
•

Z (Instrumental Variable): A variable that is correlated with X but not with the error term in the Y equation. It helps to isolate the variation in X that is exogenous, providing a means to obtain unbiased estimates of X’s effect on Y.

In this specific experiment, we conduct four separate regressions to compare the causal effects of these factors:

1.

(X = AUG-AUG, Z = Parameter): Examines the relationship where the spectral distance between augmented graphs (AUG-AUG) is the explanatory variable (X) and edge perturbation parameters are the instrument (Z).
2.

(X = Parameter, Z = AUG-AUG): Examines the relationship where the edge perturbation parameters are the explanatory variable (X) and the spectral distance between augmented graphs (AUG-AUG) is the instrument (Z).
3.

(X = OG-AUG, Z = Parameter): Examines the relationship where the spectral distance between the original and augmented graphs (OG-AUG) is the explanatory variable (X) and edge perturbation parameters are the instrument (Z).
4.

(X = Parameter, Z = OG-AUG): Examines the relationship where the edge perturbation parameters are the explanatory variable (X) and the spectral distance between the original and augmented graphs (OG-AUG) is the instrument (Z).

Table 12: IV2SLS regression results for the node-level task. The parameter

p

refers to the drop rate in DropEdge. The experiment comes in pairs for each pair of variables and the better result is marked in grey.

Variable settings	R-squared $\uparrow$	F-statistic $\uparrow$	Prob (F-statistic) $\downarrow$
(X = AUG-AUG, Z = $p$ )	0.341	45.77	1.68e-08
(Z = $p$ ,Z = AUG-AUG)	0.611	47.85	9.85e-09
(X = OG-AUG, Z = $p$ )	0.250	40.22	7.51e-08
(X = $p$ , Z = OG-AUG)	0.606	41.27	5.62e-08

The IV2SLS regression results for the node-level task in Table 12 indicate that the edge perturbation parameters are more significant determinants of model performance than spectral distances. Specifically, when the spectral distance between augmented graphs (AUG-AUG) is the explanatory variable (X) and drop rate $p$ are the instrument (Z), the model explains 34.1% of the variance in performance (R-squared = 0.341). Conversely, when the roles are reversed (X = $p$ , Z = AUG-AUG), the model explains 61.1% of the variance (R-squared = 0.611), indicating a stronger influence of edge perturbation parameter $p$ . A similar conclusion can be made when comparing OG-AUG and $p$ .

Summary of Regression Analyses

The analyses distinctly show that the direct edge perturbation parameters have a consistently stronger and more significant impact on model performance than the two types of spectral distances that serve as a reflection of spectral information. The results support the argument that while spectral information might have contributed to model performance, its significance is extremely limited and the parameters of the EP methods themselves are more critical determinants.

$\displaystyle C_{l+1}$	$\displaystyle\leq AL_{W}C_{l}+A^{l+1}BL_{W}^{l+1}\left\\|\mathbf{X}\right\\|_{2}$	(53)
	$\displaystyle\leq AL_{W}\left(l\left(AL_{W}\right)^{l}B\left\\|\mathbf{X}\right\\|_{2}\right)+A^{l+1}BL_{W}^{l+1}\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=lA^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}+A^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=(l+1)A^{l+1}L_{W}^{l+1}B\left\\|\mathbf{X}\right\\|_{2}$
	$\displaystyle=(l+1)\left(AL_{W}\right)^{l+1}B\left\\|\mathbf{X}\right\\|_{2}.$

Rethinking Spectral Augmentation for Contrast-based Graph Self-Supervised Learning

Abstract

1 Introduction

2 Related work

3 Preliminary study

4 Limitations of spectral augmentations

Theorem 1 (InfoNCE Loss Bounds).

Numerical Estimation and Interpretation.

5 Edge perturbation is all you need

5.1 Advantage of edge perturbation over spectral augmentations

Optimal learning performance.

6 Experiments on SSL performance

6.1 Experimental Settings

6.2 Experimental results

6.3 Ablation Study

7 The insignificance of Spectral Cues

7.1 Degeneration of the spectrum after Edge Perturbation (EP)

7.2 Spectral Perturbation

8 Conclusion

References

Appendix A Dataset and training configuration

Appendix B Preliminaries of Graph Spectrum and SPAN

Appendix C Objective function of GCL framework

Appendix D Theoretical analysis

D.1 Notations

D.2 Definitions and Preliminaries

Definition 1 (Local Topological Perturbation).

Definition 2 (InfoNCE Loss).

D.3 Lemmas

Lemma 1 (Adjacency Matrix Perturbation).

Proof.

Lemma 2 (Degree Matrix Change).

Proof.

Lemma 3 (Bounded Change in Normalized Adjacency Matrix).

Proof.

Lemma 4 (GNN Output Difference Bound).

Proof.

Base Case (l=0l=0).

Inductive Step.

Bounding T1T_{1}.

Bounding ‖𝐀~‖F\left\|\tilde{\mathbf{A}}\right\|_{F}.

Bounding T2T_{2}.

Bounding ‖𝐇′⁣(l)‖F\left\|\mathbf{H}^{\prime(l)}\right\|_{F}.

Total Bound for ‖𝐇(l+1)−𝐇′⁣(l+1)‖F\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}.

Recursive Relation.

Base Case.

Inductive Step.

Lemma 5 (Minimum Cosine Similarity for Positive Pairs).

Proof.

Lemma 6 (Refined Negative Pair Similarity Bound).

Proof.

D.4 Main Theorem

Theorem 1 (InfoNCE Loss Bounds).

Proof.

D.5 Numerical Estimation

Interpretation.

Appendix E More experiments

E.1 Effect of numbers of GCN Layers

E.2 Effect of GNN encoder

E.3 Relationship between spectral cues and performance of EP

E.3.1 Statistical analyses on key factors on performance of EP

Analysis 1: Polynomial Regression.

Analysis 2: Instrumental Variable Regression.

Summary of Regression Analyses

Base Case ( $l=0$ ).

Bounding $T_{1}$ .

Bounding $\left\|\tilde{\mathbf{A}}\right\|_{F}$ .

Bounding $T_{2}$ .

Bounding $\left\|\mathbf{H}^{\prime(l)}\right\|_{F}$ .

Total Bound for $\left\|\mathbf{H}^{(l+1)}-\mathbf{H}^{\prime(l+1)}\right\|_{F}$ .