Stability and Generalization of -Regularized Stochastic Learning for GCN
Abstract
Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data and have shown powerful performance in empirical experiments. That -based graph smoothing enforces the global smoothness of GCN, while (soft) -based sparse graph learning tends to promote signal sparsity to trade for discontinuity. This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general -regularized stochastic learning proposed within. While stability-based generalization analyses have been given in prior work for a second derivative objectiveness function, our -regularized learning scheme does not satisfy such a smooth condition. To tackle this issue, we propose a novel SGD proximal algorithm for GCNs with an inexact operator. For a single-layer GCN, we establish an explicit theoretical understanding of GCN with the -regularized stochastic learning by analyzing the stability of our SGD proximal algorithm. We conduct multiple empirical experiments to validate our theoretical findings.
1 Introduction
Graph Neural Networks (GNNs) have emerged as a family of powerful model designs for improving the performance of neural network models on graph-structured data. GNNs have delivered remarkable empirical performance from a diverse set of domains, such as social networks, knowledge graphs, and biological networks Duvenaud et al. (2015); Battaglia et al. (2016); Defferrard et al. (2016); Jin et al. (2018); Barrett et al. (2018); Yun et al. (2019); Zhang et al. (2020). In fact, GNNs can be viewed as natural extensions of conventional machine learning for any data where the available structure is given by pairwise relationships.
The architecture designs of various GNNs have been motivated mainly by spectral domain Defferrard et al. (2016); Kipf and Welling (2016a) and spatial domain Hamilton et al. (2017); Gilmer et al. (2017). Some popular variants of graph neural networks include Graph Convolutional Network (GCN) Bruna et al. (2013), GraphSAGE Hamilton et al. (2017), Graph Attention Network Veličković et al. (2017), Graph Isomorphism Network Xu et al. (2018), and among others.
Specifically, inherited excellent performances of traditional convolutional neural networks in processing image and time series, a standard GCN Kipf and Welling (2016a) also consists of a cascade of layers, but operates directly on a graph and induces embedding vectors of nodes based on properties of their neighborhoods. Formally, GCN is defined as the problem of learning filter parameters in the graph Fourier transform. GCNs have shown superior performances on real datasets from various domains, such as node labeling on social networks Kipf and Welling (2016b), link prediction in knowledge graphs Schlichtkrull et al. (2018), and molecular graph classification in quantum chemistry Gilmer et al. (2017).
Notably, a recent study Ma et al. (2021) has proven that GCN, even for general messaging passing models, intrinsically performs the -based graph smoothing signal, which enforces smoothness globally, and the level and smoothness are often shared across the whole graph. As opposed to -based graph smoothing, -based methods tend to penalize large values less and thus preserve discontinuity of non-smooth signal better Nie et al. (2011); Wang et al. (2015); Liu et al. (2021). Essentially, -based methods are equivalent to soft-thresholding operations for iterative estimators and guarantee statistical properties (e.g., model selection consistency). Owning to these advantages, trend filtering Tibshirani (2014), and graph trend filter Wang et al. (2015); Verma and Zhang (2019) indicate that -based graph smoothing can adapt to the inhomogeneous level of smoothness of signals and yield estimator with -th piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator.
To enhance the local smoothness adaptivity of GCNs, a family of elastic-type GCNs with a combination of and -based penalties are proposed by Liu et al. (2021), which demonstrate that the elastic GCNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks.
Under the regularized learning framework, this paper further studies a class of -regularized learning approaches , in order to trade-off the local smoothness of GCNs and sparsity between nodes. To be precise, it has been shown that an extreme case for -regularized learning with tends to generate soft-sparsity of solutions Koltchinskii (2009). In analogy to elastic-type GCNs, general -regularization can be interpreted as an interpolation of -regularization and -regularization. However, in contrast with the elastic-type GCNs, -regularized GCNs only involve a regularized parameter but leads to some additional technical difficulties in optimization and theoretical analysis.
The generalization performance of an algorithm has always been a central issue in learning theory, and particularly the generalization guarantees of GNNs has attracted a considerable amount of attention in recent years Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020).
It is worth noting that, although -regularization possesses a number of attractive properties such as “automatic” variable selection, the objectiveness with is not a strictly convex smooth function, which has been proved not to be uniform stability Xu et al. (2011). On the other hand, it is known that for , the penalty is a strictly convex function over bounded domains and thus enjoys robustness to some extent. As a bridge of sparse -regularization and dense -regularization, -based learning allows us to explicitly observe a trade-off between sparsity and smoothness of the estimated learners.
Although this idea is natural within the framework of regularized learning, it still faces many technical challenges when applied to GCNs. First, to address the problem of uniform stability of an SGD algorithm that can induce generalization performance Bousquet and Elisseeff (2002), existing related analysis for SGD Verma and Zhang (2019); Hardt et al. (2016) required that the first derivative of the objectiveness is Lipschitz continuous, which is not applicable to -based GCN. Second, to derive an interpretable generalization bound, it is important to know how such a result depends on the structure of the graph filter and the regularized parameter , as well as the network size and the sample size.
Overall, our principal contributions can be summarized as follows:
-
•
We introduce -regularized learning approaches for one-layer GCN to provide an explicit theoretical characterization of the trade-off between local smoothness and sparsity of the SGD algorithm for GCN. Crucially, we analyze how this trade-off between the graph structure and the regularization parameter affects the generalization capacity of our SGD method.
-
•
We propose a novel regularized stochastic algorithm for GCN, i.e., Inexact Proximal SGD, by integrating the standard SGD projection and the proximal operator. For our proposed method, we derive interpretable generalization bounds in terms of the graph structure, the regularized parameter , the network width, and the sample size.
-
•
To our knowledge, we are the first to analyze the generalization performance of -SGD in GNNs, which is quite different from existing stability-based generalization analysis for a second derivative objectiveness Verma and Zhang (2019); Hardt et al. (2016). We also have to overcome additional challenges in understanding the nature of the stochastic gradient for GCN, posed by the message passing nature in GCN. We conduct several numerical experiments to illustrate the superiority of our method to traditional smooth-based GCN, and we also observe some sparse solutions through our experiments as is sufficiently close to .
1.1 Additional Related Work
In this subsection, we briefly review two kinds of related works: Generalization Analysis for GNNs and Regularized Schemes on GNNs.
Generalization Analysis for GNNs.
Some previous work has attempted to address the generalization guarantees of GNNs, including Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020). However, most of these works established uniform convergence results using classical Rademacher complexityGarg et al. (2020); Oono and Suzuki (2020) and PAC bounds Liao et al. (2020). Compared to these abstract capacity notations, the stability concept, used recently for GNN in Verma and Zhang (2019), is more intuitive and directly defined over a specific algorithm. Although the stability for generalization of GNNs has been considered recently by Verma and Zhang (2019), a major difference from their work is that we focus on the trade-off between soft sparsity and generalization of the -based GCN with varying . Moreover, we propose a new SGD algorithm for GCN, under which we provide novel theoretical proof for the stability bound of our SGD.
Regularized Schemes on GNNs.
Regularization methods are frequently used in machine learning, especially the prior literature Wibisono et al. (2009) has studied the influence of -regularized learning on generalization ability. Note that the previous work only considered the impact of general regularization estimates on stability without a specific algorithm. In addition, their research is based on regular data and does not involve any graph structure. As far as we know, this paper is the first time to consider -regularized learning in a graph model.
2 Preliminaries and Methodology
In this section, we first describe basic notation on graph and standard versions of GCN. Then we introduce the structural empirical risk minimization for a single-layer GCN model under i.i.d. sampling process, and thus our -regularized approaches is naturally formulated to estimate the graph filter parameters.
A graph is represented as , where is a set of nodes and is a set of edges. The adjacency matrix of the graph is denoted by , whose entries if , and otherwise. Each node’s own feature vector is denoted by , , where is the dimension of the node feature. Let denote the node feature matrix with each row being features. The -hop neighborhood of a node is defined as the set , and denote by the set that includes the node and all nodes belonging to its -hop neighborhood. The main task of graph models is to combine the feature information and the edge information to perform learning tasks.
For an undirected graph, its Laplacian matrix is defined as , where is a degree diagonal matrix whose diagonal entry for . The semi-definite matrix has an eigen-decomposition written by , where the columns of are the eigenvectors of and the diagonal entries of diagonal matrix are the non-negative eigenvalues of .
For a fixed function , we define a graph filter as a function on the graph Laplacian . Following the eigen-decomposition of , we get , where the eigenvalues are given by . We define as the largest absolute eigenvalue of the graph filter .
In this paper, we are concerned with node-level semi-supervised learning problems over the graph . Let be the input space in which the node feature is well defined, and accordingly be the output space. In the semi-supervised setting, one assumes that only a portion of the training samples are labeled while amounts of unlabeled data are collected easily. Precisely, we merely collect a training set with labels with . For statistical inference, one often assumes that these pairwise sample are independently drawn from a joint distribution defined on . In such case, our studied model belongs to node-focused tasks on graph, as opposed to graph-focused tasks where the whole graph can be viewed as a single sample.
The most simple graph neural network, known as the Vanilla GCN, was proposed in Kipf and Welling (2016a), where each layer of a multilayer network is multiplied by the graph filter before applying a nonlinear activation function. In a matrix form, a conventional multi-layer GCN is represented by a layer-wise propagation rule
(2.1) |
where is the node feature representation output by the -th GCN layer, and specially and . represents the estimated weight matrix of the -th GCN layer, and is a point-wise nonlinear activation function. Under the context of GCN for performing semi-supervised learning, the sampling procedure of nodes from the graph is conducted by two stages. We assume node data are sampled in an i.i.d. manner by first choosing a sample or at node , and then extracting its neighbors from to form an ego-graph.
To interpret learning mechanism clearly for GCN, this paper focuses on a node-level task over graph with a single layer GCN model. In such case, putting all graph nodes together, the output function can be written in a matrix form as follows,
(2.2) |
where and . Some commonly used graph filters include a linear function of as Xu et al. (2018) or a Chebyshev polynomial of Defferrard et al. (2016).
Under the context of ego-graph, each node contains the complete information needed for computing the output of a single layer GCN model. Given node with the feature , let denote a set of the neighbor indexes at most -hop distance neighbors, which is completely determined by the graph filter . Thus we can rewrite the predictor (2.2) for a single node prediction as
(2.3) |
where is regraded as a weighted edge between node and its neighbor , and particularly it still holds if and only if .
Let be a convex loss function, measuring the difference between a predictor and the true label, a variety of supervised learning problems in machine learning can be formulated as a minimization of the expectation risk,
(2.4) |
where is a hypothesis space under which an optimal learning rule is generated. The standard regression problems correspond to the square loss given by , and the logistic loss is widely used for classification. The optimal decision function denoted by is any minimizer of (2.4) when is taken to be the space of all measurable functions. However, can not be computed directly, due to the fact that is often unknown and is too complex to compute it possibly. Instead, a frequently used method consists of minimizing a regularized empirical risk over a computationally-feasible space
(2.5) |
where is a penalty function that regularizes the complexity of the function , while is the space of all predictors which needs to be parameterized explicitly or implicitly. For instance, the neural network is known as an efficient parameterized approximation to any complex nonlinear function. In the work, all the functions within consist of the form in (2.3).
2.1 Empirical Risk Minimization with -Regularizer
A fundamental class of learning algorithms can be described as the regularized empirical risk minimization problems. This paper considers an -regularized learning approach for training the parameters of GCN:
(2.6) |
where and the -norm on is defined as . For any , there always holds , which means that any learning method with the -regularizer imposes heavier penalties for the parameters than the one with the -regularizer. Specially when , the corresponding -regularized algorithm tends to generate so called soft sparse solutions Koltchinskii (2009). In contrast to this, the commonly-used -regularization tends to generate smooth but non-sparse solutions.
Note that we do not require that the minimizer of (2.6) is unique, catering to non-convex problems. The following lemma tells us that any global minimizer of (2.6) can be upper bounded by a quantity that is inversely proportional to . This simple conclusion is useful in subsequent sections for designing a constrained SGD and our theoretical analysis with respect to the function .
Lemma 1.
Assume that with some . For any , any global minimizer of (2.6) satisfies , and furthermore .
3 Regularized Stochastic Algorithm
In order to effectively solve non-convex problems with massive data, practical algorithms for machine learning are increasingly constrained to spend less time and use less memory, and can also escape from saddle points that often appear in non-convex problems and tend to converge to a good stationary point. Stochastic gradient descent (SGD) is perhaps the simplest and most well studied algorithm that enjoys these advantages. The merits of SGD for large scale learning and the associated computation versus statistics tradeoffs is discussed in detail by the seminal work of Bottou and Bousquet (2007).
A standard assumption to analyze SGD in the literature is that the derivative of the objective function is Lipschitz smooth Verma and Zhang (2019); Hardt et al. (2016), however, the -regularized learning does not meet such condition. To address this issue, we propose a new SGD for (2.6) with an inexact proximal operator, and then develop a novel theoretical analysis for an upper bound of uniform stability Bousquet and Elisseeff (2002), which is an algorithm-dependent sensitivity-based measurement used for characterizing generalization performance in learning theory.
Given that a positive pair satisfies the equality , then the norms and are dual to each other. Moreover, the pair of functions and are conjugate functions of each other. As a consequence, their gradient mappings are a pair of inverse mapping. Formally, let and , and define the mapping with
and the inverse mapping with
The above conjugate property on -space and -space is very useful for bounding uniform stability without the help of strong smoothness, while the latter is a standard assumption in optimization.
3.1 SGD with Inexact Proximal Operator
We write for notational simplicity, and define a local quadratic approximation of at point as
At each iteration , let be a random index sampled uniformly from on . Then replacing in (2.6) by the quadratic term , we propose an inexact proximal method for SGD with the -regularizer. Precisely, we are concerned with the following iterative to update for a regularized-based SGD,
(3.1) |
In view of the boundedness of given in Lemma 1, we adopt the projection technique to execute a constrained SGD over the empirical risk term in (3.1). To this end, we define the projection onto a set by
This reveals that the definition of projection is an optimization problem in itself. In our case, the set we adopt is given as
It is well known that, if , i.e., the unit ball, then projection is equivalent to a normalization step
Up to the terms that do not depend on , summing the objective function in (3.1) and the projection formulate our proposed algorithm as follows
(3.2) | ||||
(3.3) |
where is the learning rate that depends on , and the may depend on and .
Remark 1.
The update rule in (3.3) is seen as a contraction of conventional SGD, see Lemma 2 below for details. We will obtain analytical solutions of (3.3) for some specific (e.g. ). Although there is no analytical solutions for general , the objective function in (3.3) is strongly convex over bounded domains and thus a global convergence can be guaranteed. For the ease of notation, we still denote by the realized numerical solution with ignoring the inner optimization error.
Lemma 2.
For and a vector is given, we define
(3.4) |
Then, we conclude that
We defer the proof of Lemma 2 to the Appendix. Applying Lemma 2 and the projection onto in (3.2), we have the following inequality
(3.5) |
Consider a function defined as . Note that it holds
Obviously the first derivative of exists and is continuous for all . However, we notice that the inexact proximal operator is not Lipschitz, due to the fact that is not strongly smooth. Hence, as mentioned earlier, those conventional technique analysis under strongly smooth condition for objective functions are no longer valid in our case. Fortunately, the function is strongly convex over bounded domains, as shown in Lemma 1, which enables us to avoid restrictive smooth assumptions by an alternative proof strategy.

4 Stability and Generalization Bounds
In this section, we provide algorithm-dependent generalization bounds via the notion of stability for -regularized GCN. To this end, we first introduce the notion of algorithmic stability and thereby present a generalization bound associated with the algorithmic stability.
Let be a learning algorithm trained on dataset , which can be viewed as a map from . For GCN, we set . The overall learning performance of is measured by the following expected risk:
Accordingly, the empirical risk of with the loss is given as
Even when the sample is fixed, may be still a randomized algorithm due to the randomness of algorithm procedure (e.g. SGD). In this context, we define the expected generalization error as
where the expectation is taken over the inherent randomness of .
For a randomized algorithm, to introduce the notation of its uniform stability, we need to define two datasets as follows. Given the training set defined as above, we introduce two related sets in the following:
Removing the -th data point in the set is represented as
and replacing the -th data point in by is represented as
Definition 1.
A randomized learning algorithm is -uniformly stable with respect to a loss function , if it satisfies
By the triangle inequality, the following result on another uniform stability associated with holds
Stability is property of a learning algorithm, roughly speaking, if two training samples are close to each other, a stable algorithm will generate close output results. There are many variants of stability, such as hypothesis stability Kearns and Ron (1999), sample average stability Shalev-Shwartz et al. (2010) and uniform stability. This paper will focus on the uniform stability, since it is closely related to other types of stability.
The following lemma shows that a randomized learning algorithm with uniform stability can guarantee meaningful generalization bound, which has been proved in Verma and Zhang (2019).
Lemma 3.
A uniform stable randomized algorithm with a bounded loss function , satisfies the following generalization bound with probability at least , over the random draw of with ,
We now give some smooth assumptions on loss function and activation function used for analyzing the stability of stochastic gradient descent. The following assumptions are very standard in optimization literature.
Assumption 1 (Smoothness for loss function and activation function).
We assume that the loss function is lipschitz continuous and smooth,
where and are two positive constants. Similarly, the activation function also satisfies
and |
where and are positive constants as well.
We now present an explicit stability bound for GCN with -regularizer via stochastic gradient method.
Theorem 1.
The proof procedure of Theorem 1 will be given in Appendix. The key step of our proof is that in such a scenario, the error caused by the difference in the nonconvex empirical risk of GCN grows polynomially with the number of iterations.
Remark 2.
Theorem 1 provides the uniform stability for the last iterate of SGD for -regularized GCN. It is worth mentioning that the upper bound of this stability depends on the graph structure (i.e. and ) and the regularized hyper-parameter , as well as the sample size and the learning rate in SGD.
Remark 3.
More precisely, the result in Theorem 1 shows that the stability bound decreases inversely with the scale of . It increases as the graph structured parameter increases.
Substituting Lemma 3 into Theorem 1 above, we obtain the generalization bounds with uniform stability for a single layer GCN with -regularizer.
Theorem 2.
Under the same conditions as Theorem 1. with a high probability, the following generalization bound holds
Remark 4.
Based on the result of Theorem 2, we conclude that -regularization for generalizes. Note that, when , the stability bound breaks due to the sparsity of -regularization. The smaller becomes, the greater the stability parameter is, but at the same time the obtained parameter in (2.6) tends to be sparse, shown in Koltchinskii (2009). These properties are also verified in our experimental evaluation.
Remark 5.
Although a small learning rate means a smaller generalization gap, the parameter range in which this SGD searches will be very small, resulting in a larger training error. Such conclusion is also applicable to various SGD for general models.
Dataset | Graph Filter | ||||||
---|---|---|---|---|---|---|---|
Citeseer | Augmented Normalized | ||||||
Normalized | |||||||
Random Walk | |||||||
Unnormalized | |||||||
Cora | Augmented Normalized | ||||||
Normalized | |||||||
Random Walk | |||||||
Unnormalized | |||||||
Pubmed | Augmented Normalized | ||||||
Normalized | |||||||
Random Walk | |||||||
Unnormalized |
5 Experiments
In this section, we conduct experiments to validate our theoretical findings. In particular, we show that there exists a stability-sparsity trade-off with varying , and the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter. To do it, we first introduce the experimental settings. Then we assess the uniform stability of GCN with semi-supervised learning tasks under varying . Finally, we evaluate the effect of different graph filters on the GCN stability bounds.

5.1 Experimental Setup
Datasets.
We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. (2008). In every dataset, each document is represented as spare bag-of-words feature vectors. The relationship between documents consists of a list of citation links, which can be treated as undirected edges and help construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label.
Baselines.
We implement several empirical experiments with a representative GCN model Kipf and Welling (2016a). For all datasets, we use 2-layer neural networks with 16 hidden units. In all cases, we evaluate the difference between the learned weight parameters and the generalization gap of two GCN models trained on datasets and , respectively. Specifically, we generate by choosing a random data point in the training set and altering it with a different random point. We also record the training and testing errors gap and the parameter distance where and are the weight parameters of the respective models per epoch.
Training settings.
For each experiment, we initialize the parameters of GCN models with the same random seeds and then train all models for a maximum of 200 epochs using the proposed Inexact Proximal SGD. We repeat the experiments 10 times and report the average performance as well as the standard variance. For all methods, the hyperparameters are tuned from the following search space: (1) learning rate: {1, 0.5, 0.1, 0.05}; (2) weight decay: 0; (3) dropout rate: ; (4) regularization parameter is set to 0.001.
5.2 The Effect of Varying
Generalization gap.
In this experiment, we empirically compare the effect of on the GCN stability bounds using different . We quantitatively measure the generalization gap defined as the absolute difference between the training and testing errors and the difference between learned weights parameters of GCN models trained on and , respectively. It can be observed that in Figure 1, these empirical observations are in line with our stability bounds (see also results in Figure 3 in Appendix).
Sparsity.
Besides the convergence, we have a particular concern about the sparsity of the solutions. The sparsity ratios for -based method are summarized in Table 1. Observe that -regularized learning with identifies most of the sparsity pattern but behaves much worse in generalization.
5.3 The Effect of Graph Filters
Different normalizations steps.
In this experiment, we mainly consider the implication of our results in following designing graph convolution filters: (1) Unnormalized Graph Filters: ; (2) Normalized Graph Filters: ; (3) Random Walk Graph Filters: ; (4) Augmented Normalized Graph Filters: .
In this experiment, we quantitatively measure the generalization gap and parameter distance per epoch. From Figure 2, it is clear that the Unnormalized Graph Filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets. Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap and parameter distance become constant after a certain number of iterations. More results can be found in the supplementary material.
6 Conclusion
In this paper, we present an explicit theoretical understanding of stochastic learning for GCN with the regularizer and analyze the stability of our regularized stochastic algorithm. In particular, our derived bounds show that the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter, and there exists a generalization-sparsity tradeoff with varying . It is worth noting that previous generalization analysis based on stability notation assumed that objectiveness is a second derivative function, which is no longer applicable to our -regularized learning scheme. To address this issue, we propose a new proximal SGD algorithm for GCNs with an inexact operator, which exhibits comparable empirical performances.
Acknowledgements
Lv’s research is supported by the “QingLan” Project of Jiangsu Universities. Ming Li would like to thank the support from the National Natural Science Foundation of China (No. U21A20473, No. 62172370) and Zhejiang Provincial Natural Science Foundation (No. LY22F020004).
Contribution Statement
Linsen We and Ming Li contributed equally to this work.
References
- Barrett et al. [2018] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pages 511–520. PMLR, 2018.
- Battaglia et al. [2016] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, pages 4502–4510, 2016.
- Bottou and Bousquet [2007] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2007.
- Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
- Bruna et al. [2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
- Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, volume 29, pages 3844–3852, 2016.
- Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, volume 28, pages 2224–2232, 2015.
- Garg et al. [2020] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pages 3419–3430. PMLR, 2020.
- Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
- Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035, 2017.
- Hardt et al. [2016] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
- Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, pages 2323–2332. PMLR, 2018.
- Kearns and Ron [1999] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999.
- Kipf and Welling [2016a] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Kipf and Welling [2016b] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
- Koltchinskii [2009] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. In Annales de l’IHP Probabilités et statistiques, volume 45, pages 7–57, 2009.
- Liao et al. [2020] Renjie Liao, Raquel Urtasun, and Richard Zemel. A pac-bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations, 2020.
- Liu et al. [2021] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, pages 6837–6849. PMLR, 2021.
- Ma et al. [2021] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
- Nie et al. [2011] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Unsupervised and semi-supervised learning via 1-norm graph. In 2011 International Conference on Computer Vision, pages 2268–2273. IEEE, 2011.
- Oono and Suzuki [2020] Kenta Oono and Taiji Suzuki. Optimization and generalization analysis of transduction through gradient boosting and application to multi-scale graph neural networks. Advances in Neural Information Processing Systems, 33:18917–18930, 2020.
- Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.
- Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
- Shalev-Shwartz et al. [2010] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
- Tibshirani [2014] Ryan J Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
- Veličković et al. [2017] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- Verma and Zhang [2019] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1539–1548, 2019.
- Wang et al. [2015] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan Tibshirani. Trend filtering on graphs. In Artificial Intelligence and Statistics, pages 1042–1050. PMLR, 2015.
- Wibisono et al. [2009] Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio. Sufficient conditions for uniform stability of regularization algorithms. Computer Science and Artificial Intelligence Laboratory Technical Report, MIT-CSAIL-TR-2009-060, 2009.
- Xu et al. [2011] Huan Xu, Constantine Caramanis, and Shie Mannor. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on Pattern Analysis and Machine Intelligence, 34(1):187–193, 2011.
- Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
- Yun et al. [2019] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in Neural Information Processing Systems, 32:11983–11993, 2019.
- Zhang et al. [2020] Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, and Le Song. Efficient probabilistic logic reasoning with graph neural networks. arXiv preprint arXiv:2001.11850, 2020.
The appendices are structured as follows. In Appendix A.1, we provide the proof procedure of Theorem 1. The details of proof for Lemma 2 is sketched in In Appendix B. In Appendix C, we provide additional experiments and detailed setups.
Appendix A Proof of Theorem
A.1 Proof Sketch
We describe the main procedures by training a GCN using SGD on datasets and which differ in one data point. Let be a sequence of samples, where is an i.i.d. sample drawn from at the -th iteration of SGD during a training run of the GCN. Training the same GCN using SGD on means that we endow the same sequence to the GCN except that if occurs for some , we replace it with . We denote this sample sequence by . Let and be the corresponding sequences of the weight parameters learning by running SGD on and , respectively.
Given two sequences of the weighted parameters, and . At each iteration, we define the difference of these two sequences by
We now relate to the stability parameter given in Definition 1. Since the loss is Lipschitz continuous by Assumption 1, for any we have
(A.1) |
where the second inequality follows from the Lipschitz property of , and the last one is based on the Cauchy-Schwarz inequality.
A.2 Bounding
This subsection is devoted to bounding due to the randomness inherent in SGD. This will be proved though a series of five lemmas. Different from existing related work Verma and Zhang [2019]; Hardt et al. [2016], the strong convexity of over bounded domains plays an important role in our desired results. It should be pointed out that, the first derivative of our objective function is not Lipschitz, the standard technique for proving the stability of SGD Hardt et al. [2016] is no longer applicable to our setting.
Proposition 1.
Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed algorithm (3.2)-(3.3) at step , satisfies
(A.2) |
At step , is a drawn sample from and is a drawn sample from . Here is defined as Theorem 1.
Proof of Proposition 1.
For every , we know the following equality from (4) in Wibisono et al. [2009] that
(A.3) |
where and the choice of in our case will be given later. The equality (A.3) shows that is strongly convex over bounded domains, even though its first derivative is not Lipschitz smooth.
In particular, we can set and to be the -th components of and of respectively. Then we obtain
(A.4) |
where . As stated in (3.5), we can derive a bound as follows
Furthermore, since , this follows that
Substituting the bound on to (A.4) and then summing over , we obtain
(A.5) |
Lemma 4.
For any , we have
Combining the inequality above with (A.2), we get
which immediately yields our desired result in Proposition 1.
Since and are two random samples from different datasets ( and ) respectively, we need to consider one of the following two scenarios, which must occur at iteration .
(Scenario 1) At step , SGD picks a sample which is identical in and , occurring with probability .
(Scenario 2) At step , SGD picks the only samples that and differs, e.g. , which occurs with probability .
Consider the Scenario in Lemma 5 below. In this case, we will show that can be bounded by the terms and .
Lemma 5.
Under Assumption 1 for Scenario 1. There holds
Unlike Scenario 1 given in Lemma 5, the following lemma gives a more rough bound of the gradient difference under Scenario 2.
Lemma 6.
Under Assumption 1 for Scenario 2. There holds
Substituting the results derived in Lemma 5 and Lemma 6 to Proposition 1, and taking expectation over all possible sample sequences from and , we have the following lemma.
Lemma 7.
Proof of Lemma 7..
From (A.2) in Proposition 1, this together with the derived results in Lemma 5 and Lemma 6 yields that, can be upper bounded by
Finally, an application of the recursion for the above inequality leads to
where we use the fact that the parameter initialization is kept same for and , e.g., . Thus, the proof for Lemma 7 is completed. ∎
Appendix B Proof for Lemmas
B.1 Deferred Proof for Lemma 2.
B.2 Deferred Proof for Lemma 4.
Proof for Lemma 4..
Let us introduce the notation:
and similarly
Recall that a convex function admits the following inequality:
The convexity of and immediately leads to
Then, switching the role of and , we get
Summing the two previous inequalities yields
(B.2) |
Now, by the definitions of and , we have
B.3 Deferred Proof for Lemma 5.
B.4 Deferred Proof for Lemma 6.
B.5 Additional Lemma
Lemma 8.
Let be an undirected graph with a weighted adjacency matrix and be the maximum absolute eigenvalue of . Let be the ego-graph of a node with corresponding maximum absolute eigenvalue . Then the following eigenvalue bound holds for all ,
Appendix C Additional Experiments and Setup Details
C.1 General Setup
We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. [2008]. In every datasets, each document represents as spare bag-of-words feature vector, and the relationship between documents consists of a list of citation links, which can be treated as undirected edges and help to construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label. The data statistics for the datasets used in Section 5 are summarized in Table 2.
Dataset | Nodes | Edges | Classes | Features |
---|---|---|---|---|
Citeseer | ||||
Cora | ||||
Pubmed |
C.2 Additional Experimental Results
In this section, we investigate the stability and generalization of -regularized stochastic learning for GCN under different normalization steps. We mainly consider the implication of our results in following designing graph convolution filters:
-
(I)
Unnormalized Graph Filters: ;
-
(II)
Normalized Graph Filters: ;
-
(III)
Random Walk Graph Filters: ;
-
(IV)
Augmented Normalized Graph Filters: .
For each experiment, we initialize the parameters of GCN models with the same random seeds and then train all models for a maximum of 200 epochs using the proposed Inexact Proximal SGD. We repeat the experiments ten times and report the average performance and the standard variance. We quantitatively measure the generalization gap defined as the absolute difference between the training and test errors and the parameter distance between the parameters and of two models trained on two copies of the data differing in a random substitution.


The unnormalized graph convolution filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets under different . Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap becomes constant after a certain number of iterations.