Stability and Generalization of $\ell_{p}$ -Regularized Stochastic Learning for GCN

Anonymous Author(s) Shiyu Liu¹ Linsen Wei²¹¹1Equal contribution. Shaogao Lv³²²2Corresponding author. E-mail: [email protected] &Ming Li⁴¹¹1Equal contribution. ¹University of Electronic Science and Technology of China, China
²School of Astronautics, Northwestern Polytechnical University, China
³Department of Statistics and Data Science, Nanjing Audit University, China
⁴Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, China

Abstract

Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data and have shown powerful performance in empirical experiments. That $\ell_{2}$ -based graph smoothing enforces the global smoothness of GCN, while (soft) $\ell_{1}$ -based sparse graph learning tends to promote signal sparsity to trade for discontinuity. This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general $\ell_{p}$ -regularized $(1<p\leq 2)$ stochastic learning proposed within. While stability-based generalization analyses have been given in prior work for a second derivative objectiveness function, our $\ell_{p}$ -regularized learning scheme does not satisfy such a smooth condition. To tackle this issue, we propose a novel SGD proximal algorithm for GCNs with an inexact operator. For a single-layer GCN, we establish an explicit theoretical understanding of GCN with the $\ell_{p}$ -regularized stochastic learning by analyzing the stability of our SGD proximal algorithm. We conduct multiple empirical experiments to validate our theoretical findings.

1 Introduction

Graph Neural Networks (GNNs) have emerged as a family of powerful model designs for improving the performance of neural network models on graph-structured data. GNNs have delivered remarkable empirical performance from a diverse set of domains, such as social networks, knowledge graphs, and biological networks Duvenaud et al. (2015); Battaglia et al. (2016); Defferrard et al. (2016); Jin et al. (2018); Barrett et al. (2018); Yun et al. (2019); Zhang et al. (2020). In fact, GNNs can be viewed as natural extensions of conventional machine learning for any data where the available structure is given by pairwise relationships.

The architecture designs of various GNNs have been motivated mainly by spectral domain Defferrard et al. (2016); Kipf and Welling (2016a) and spatial domain Hamilton et al. (2017); Gilmer et al. (2017). Some popular variants of graph neural networks include Graph Convolutional Network (GCN) Bruna et al. (2013), GraphSAGE Hamilton et al. (2017), Graph Attention Network Veličković et al. (2017), Graph Isomorphism Network Xu et al. (2018), and among others.

Specifically, inherited excellent performances of traditional convolutional neural networks in processing image and time series, a standard GCN Kipf and Welling (2016a) also consists of a cascade of layers, but operates directly on a graph and induces embedding vectors of nodes based on properties of their neighborhoods. Formally, GCN is defined as the problem of learning filter parameters in the graph Fourier transform. GCNs have shown superior performances on real datasets from various domains, such as node labeling on social networks Kipf and Welling (2016b), link prediction in knowledge graphs Schlichtkrull et al. (2018), and molecular graph classification in quantum chemistry Gilmer et al. (2017).

Notably, a recent study Ma et al. (2021) has proven that GCN, even for general messaging passing models, intrinsically performs the $\ell_{2}$ -based graph smoothing signal, which enforces smoothness globally, and the level and smoothness are often shared across the whole graph. As opposed to $\ell_{2}$ -based graph smoothing, $\ell_{1}$ -based methods tend to penalize large values less and thus preserve discontinuity of non-smooth signal better Nie et al. (2011); Wang et al. (2015); Liu et al. (2021). Essentially, $\ell_{1}$ -based methods are equivalent to soft-thresholding operations for iterative estimators and guarantee statistical properties (e.g., model selection consistency). Owning to these advantages, trend filtering Tibshirani (2014), and graph trend filter Wang et al. (2015); Verma and Zhang (2019) indicate that $\ell_{1}$ -based graph smoothing can adapt to the inhomogeneous level of smoothness of signals and yield estimator with $k$ -th piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator.

To enhance the local smoothness adaptivity of GCNs, a family of elastic-type GCNs with a combination of $\ell_{2}$ and $\ell_{1}$ -based penalties are proposed by Liu et al. (2021), which demonstrate that the elastic GCNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks.

Under the regularized learning framework, this paper further studies a class of $\ell_{p}$ -regularized learning approaches $(1<p\leq 2)$ , in order to trade-off the local smoothness of GCNs and sparsity between nodes. To be precise, it has been shown that an extreme case for $\ell_{p}$ -regularized learning with $p\rightarrow 1$ tends to generate soft-sparsity of solutions Koltchinskii (2009). In analogy to elastic-type GCNs, general $\ell_{p}$ -regularization can be interpreted as an interpolation of $\ell_{1}$ -regularization and $\ell_{2}$ -regularization. However, in contrast with the elastic-type GCNs, $\ell_{p}$ -regularized GCNs only involve a regularized parameter but leads to some additional technical difficulties in optimization and theoretical analysis.

The generalization performance of an algorithm has always been a central issue in learning theory, and particularly the generalization guarantees of GNNs has attracted a considerable amount of attention in recent years Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020).

It is worth noting that, although $\ell_{1}$ -regularization possesses a number of attractive properties such as “automatic” variable selection, the objectiveness with $p=1$ is not a strictly convex smooth function, which has been proved not to be uniform stability Xu et al. (2011). On the other hand, it is known that for $p>1$ , the penalty is a strictly convex function over bounded domains and thus enjoys robustness to some extent. As a bridge of sparse $\ell_{1}$ -regularization and dense $\ell_{2}$ -regularization, $\ell_{p}$ -based learning allows us to explicitly observe a trade-off between sparsity and smoothness of the estimated learners.

Although this idea is natural within the framework of regularized learning, it still faces many technical challenges when applied to GCNs. First, to address the problem of uniform stability of an SGD algorithm that can induce generalization performance Bousquet and Elisseeff (2002), existing related analysis for SGD Verma and Zhang (2019); Hardt et al. (2016) required that the first derivative of the objectiveness is Lipschitz continuous, which is not applicable to $\ell_{p}$ -based GCN. Second, to derive an interpretable generalization bound, it is important to know how such a result depends on the structure of the graph filter and the regularized parameter $p$ , as well as the network size and the sample size.

Overall, our principal contributions can be summarized as follows:

•

We introduce $\ell_{p}$ -regularized learning approaches for one-layer GCN to provide an explicit theoretical characterization of the trade-off between local smoothness and sparsity of the SGD algorithm for GCN. Crucially, we analyze how this trade-off between the graph structure and the regularization parameter $p$ affects the generalization capacity of our SGD method.
•

We propose a novel regularized stochastic algorithm for GCN, i.e., Inexact Proximal SGD, by integrating the standard SGD projection and the proximal operator. For our proposed method, we derive interpretable generalization bounds in terms of the graph structure, the regularized parameter $p$ , the network width, and the sample size.
•

To our knowledge, we are the first to analyze the generalization performance of $\ell_{p}$ -SGD in GNNs, which is quite different from existing stability-based generalization analysis for a second derivative objectiveness Verma and Zhang (2019); Hardt et al. (2016). We also have to overcome additional challenges in understanding the nature of the stochastic gradient for GCN, posed by the message passing nature in GCN. We conduct several numerical experiments to illustrate the superiority of our method to traditional smooth-based GCN, and we also observe some sparse solutions through our experiments as $p$ is sufficiently close to $1$ .

1.1 Additional Related Work

In this subsection, we briefly review two kinds of related works: Generalization Analysis for GNNs and Regularized Schemes on GNNs.

Generalization Analysis for GNNs.

Some previous work has attempted to address the generalization guarantees of GNNs, including Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020). However, most of these works established uniform convergence results using classical Rademacher complexityGarg et al. (2020); Oono and Suzuki (2020) and PAC bounds Liao et al. (2020). Compared to these abstract capacity notations, the stability concept, used recently for GNN in Verma and Zhang (2019), is more intuitive and directly defined over a specific algorithm. Although the stability for generalization of GNNs has been considered recently by Verma and Zhang (2019), a major difference from their work is that we focus on the trade-off between soft sparsity and generalization of the $\ell_{p}$ -based GCN with varying $p\in(1,2]$ . Moreover, we propose a new SGD algorithm for GCN, under which we provide novel theoretical proof for the stability bound of our SGD.

Regularized Schemes on GNNs.

Regularization methods are frequently used in machine learning, especially the prior literature Wibisono et al. (2009) has studied the influence of $\ell_{p}$ -regularized learning on generalization ability. Note that the previous work only considered the impact of general regularization estimates on stability without a specific algorithm. In addition, their research is based on regular data and does not involve any graph structure. As far as we know, this paper is the first time to consider $\ell_{p}$ -regularized learning in a graph model.

2 Preliminaries and Methodology

In this section, we first describe basic notation on graph and standard versions of GCN. Then we introduce the structural empirical risk minimization for a single-layer GCN model under i.i.d. sampling process, and thus our $\ell_{p}$ -regularized approaches is naturally formulated to estimate the graph filter parameters.

A graph is represented as $G=(V,E)$ , where $V=\{\nu_{1},\nu_{2},...,\nu_{n}\}$ is a set of $n=|V|$ nodes and $E$ is a set of $|E|$ edges. The adjacency matrix of the graph is denoted by $\mathbf{A}=(a_{ij})\in\mathbb{R}^{n\times n}$ , whose entries $a_{ij}=1$ if $(\nu_{i},\nu_{j})\in E$ , and $a_{ij}=0$ otherwise. Each node’s own feature vector is denoted by $\mathbf{x}_{i}\in\mathbb{R}^{d}$ , $i\in[n]$ , where $d$ is the dimension of the node feature. Let $\mathbf{X}\in\mathbb{R}^{n\times d}$ denote the node feature matrix with each row being $d$ features. The $1$ -hop neighborhood of a node $\nu_{i}$ is defined as the set $\{\nu_{j},(\nu_{i},\nu_{j})\in E\}$ , and denote by $N_{(i)}$ the set that includes the node $\nu_{i}$ and all nodes belonging to its $1$ -hop neighborhood. The main task of graph models is to combine the feature information and the edge information to perform learning tasks.

For an undirected graph, its Laplacian matrix $\mathbf{L}\in\mathbb{R}^{n\times n}$ is defined as $\mathbf{L}:=\mathbf{D}-\mathbf{A}$ , where $\mathbf{D}\in\mathbb{R}^{n\times n}$ is a degree diagonal matrix whose diagonal entry $d_{ii}=\sum_{j}a_{ij}$ for $i\in[n]$ . The semi-definite matrix $\mathbf{L}$ has an eigen-decomposition written by $\mathbf{L}=\mathbf{U}\bm{\Lambda}\mathbf{U}^{T}$ , where the columns of $\mathbf{U}$ are the eigenvectors of $\mathbf{L}$ and the diagonal entries of diagonal matrix $\bm{\Lambda}$ are the non-negative eigenvalues of $\mathbf{L}$ .

For a fixed function $g$ , we define a graph filter $g(\mathbf{L})\in\mathbb{R}^{n\times n}$ as a function on the graph Laplacian $\mathbf{L}$ . Following the eigen-decomposition of $\mathbf{L}$ , we get $g(\mathbf{L})=\mathbf{U}g(\bm{\Lambda})\mathbf{U}^{T}$ , where the eigenvalues are given by $\lambda_{i}^{(g)}=\{g(\lambda_{i}),1\leq i\leq n\}$ . We define $\lambda^{max}_{G}=\max\{|\lambda_{i}^{(g)}|\}$ as the largest absolute eigenvalue of the graph filter $g(\mathbf{L})$ .

In this paper, we are concerned with node-level semi-supervised learning problems over the graph $G$ . Let $\mathcal{X}\subset\mathbb{R}^{d}$ be the input space in which the node feature is well defined, and accordingly $\mathcal{Y}\subset\mathbb{R}$ be the output space. In the semi-supervised setting, one assumes that only a portion of the training samples are labeled while amounts of unlabeled data are collected easily. Precisely, we merely collect a training set with labels $D=\{\mathbf{z}_{i}=(\mathbf{x}_{i},y_{i})\}_{i=1}^{m}$ with $m\ll n$ . For statistical inference, one often assumes that these pairwise sample are independently drawn from a joint distribution $\rho$ defined on $\mathcal{X}\times\mathcal{Y}$ . In such case, our studied model belongs to node-focused tasks on graph, as opposed to graph-focused tasks where the whole graph can be viewed as a single sample.

The most simple graph neural network, known as the Vanilla GCN, was proposed in Kipf and Welling (2016a), where each layer of a multilayer network is multiplied by the graph filter before applying a nonlinear activation function. In a matrix form, a conventional multi-layer GCN is represented by a layer-wise propagation rule

\displaystyle\mathbf{H}^{(k+1)}=\sigma\big{(}g(\mathbf{L})\mathbf{H}^{(k)}\mathbf{W}^{(k+1)}\big{)},

(2.1)

where $\mathbf{H}^{(k+1)}\in\mathbb{R}^{n\times m_{k+1}}$ is the node feature representation output by the $(k+1)$ -th GCN layer, and specially $\mathbf{H}^{(0)}=\mathbf{X}$ and $m_{0}=d$ . $\mathbf{W}^{(k+1)}\in\mathbb{R}^{m_{k}\times m_{k+1}}$ represents the estimated weight matrix of the $(k+1)$ -th GCN layer, and $\sigma$ is a point-wise nonlinear activation function. Under the context of GCN for performing semi-supervised learning, the sampling procedure of nodes from the graph $G$ is conducted by two stages. We assume node data are sampled in an i.i.d. manner by first choosing a sample $\mathbf{x}_{i}$ or $\mathbf{z}_{i}$ at node $i$ , and then extracting its neighbors from $G$ to form an ego-graph.

To interpret learning mechanism clearly for GCN, this paper focuses on a node-level task over graph with a single layer GCN model. In such case, putting all graph nodes together, the output function can be written in a matrix form as follows,

f(\mathbf{X},\mathbf{w})=\sigma\big{(}g(\mathbf{L})\mathbf{X}\mathbf{w}\big{)},

(2.2)

where $\mathbf{w}\in\mathbb{R}^{d}$ and $f(\mathbf{X},\mathbf{w})\in\mathbb{R}^{n}$ . Some commonly used graph filters include a linear function of $\mathbf{A}$ as $g(\mathbf{A})=\mathbf{A}+\mathbf{I}$ Xu et al. (2018) or a Chebyshev polynomial of $\mathbf{L}$ Defferrard et al. (2016).

Under the context of ego-graph, each node contains the complete information needed for computing the output of a single layer GCN model. Given node with the feature $\mathbf{x}$ , let $N_{\mathbf{x}}$ denote a set of the neighbor indexes at most $1$ -hop distance neighbors, which is completely determined by the graph filter $g(\mathbf{L})$ . Thus we can rewrite the predictor (2.2) for a single node prediction as

f(\mathbf{x},\mathbf{w})=\sigma\big{(}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)},

(2.3)

where $e_{\cdot j}=[g(\mathbf{L})]_{\cdot j}\in\mathbb{R}$ is regraded as a weighted edge between node $\mathbf{x}$ and its neighbor $\mathbf{x}_{j}$ , and particularly it still holds $j\in N_{\mathbf{x}}$ if and only if $e_{\cdot j}\neq 0$ .

Let $\ell(\cdot,\cdot)$ be a convex loss function, measuring the difference between a predictor and the true label, a variety of supervised learning problems in machine learning can be formulated as a minimization of the expectation risk,

\min_{f\in\mathcal{F}}\mathbb{E}\big{[}\ell\big{(}Y,f(X)\big{)}\big{]},

(2.4)

where $\mathcal{F}$ is a hypothesis space under which an optimal learning rule is generated. The standard regression problems correspond to the square loss given by $\ell(u,v)=(u-v)^{2}$ , and the logistic loss is widely used for classification. The optimal decision function denoted by $f_{0}$ is any minimizer of (2.4) when $\mathcal{F}$ is taken to be the space of all measurable functions. However, $f_{0}$ can not be computed directly, due to the fact that $\rho$ is often unknown and $\mathcal{F}$ is too complex to compute it possibly. Instead, a frequently used method consists of minimizing a regularized empirical risk over a computationally-feasible space

\displaystyle\min_{f\in\mathcal{H}}\frac{1}{m}{\sum}_{i=1}^{m}\ell\big{(}y_{i},f(\mathbf{x}_{i})\big{)}+\Omega(f),

(2.5)

where $\Omega(f)$ is a penalty function that regularizes the complexity of the function $f$ , while $\mathcal{H}$ is the space of all predictors which needs to be parameterized explicitly or implicitly. For instance, the neural network is known as an efficient parameterized approximation to any complex nonlinear function. In the work, all the functions within $\mathcal{H}$ consist of the form in (2.3).

2.1 Empirical Risk Minimization with $\ell_{p}$ -Regularizer

A fundamental class of learning algorithms can be described as the regularized empirical risk minimization problems. This paper considers an $\ell_{p}$ -regularized learning approach for training the parameters of GCN:

\displaystyle\widehat{\mathbf{w}}\in\arg\min_{\mathbf{w}}\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{w})\big{)}+\lambda\|\mathbf{w}\|_{\ell_{p}}^{p},

(2.6)

where $1<p\leq 2$ and the $\ell_{p}$ -norm on $\mathbb{R}^{d}$ is defined as $\|\mathbf{w}\|_{\ell_{p}}^{p}=\sum_{j=1}^{d}|w_{j}|^{p}$ . For any $1<p^{\prime}\leq p\leq 2$ , there always holds $\|\mathbf{w}\|_{\ell_{p}}\leq\|\mathbf{w}\|_{\ell_{p^{\prime}}}$ , which means that any learning method with the $\ell_{p^{\prime}}$ -regularizer imposes heavier penalties for the parameters than the one with the $\ell_{p}$ -regularizer. Specially when $p\rightarrow 1$ , the corresponding $\ell_{p}$ -regularized algorithm tends to generate so called soft sparse solutions Koltchinskii (2009). In contrast to this, the commonly-used $\ell_{2}$ -regularization tends to generate smooth but non-sparse solutions.

Note that we do not require that the minimizer of (2.6) is unique, catering to non-convex problems. The following lemma tells us that any global minimizer of (2.6) can be upper bounded by a quantity that is inversely proportional to $\lambda$ . This simple conclusion is useful in subsequent sections for designing a constrained SGD and our theoretical analysis with respect to the function $\|\mathbf{w}\|_{\ell_{p}}^{p}$ .

Lemma 1.

Assume that $\ell(y,\sigma(0))\leq B$ with some $B>0$ . For any $\lambda>0$ , any global minimizer $\widehat{\mathbf{w}}$ of (2.6) satisfies $\|\widehat{\mathbf{w}}\|_{\ell_{p}}^{p}\leq{B}/{\lambda}$ , and furthermore $\|\widehat{\mathbf{w}}\|_{\ell_{2}}\leq\big{(}{B}/{\lambda}\big{)}^{1/p}$ .

Proof of Lemma 1..

Since $\widehat{\mathbf{w}}$ is a global minimizer of the objective function in (2.6), this follows that

\displaystyle\lambda\|\widehat{\mathbf{w}}\|_{\ell_{p}}^{p}

\displaystyle\leq\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{0})\big{)}+\lambda\|\mathbf{0}\|_{\ell_{p}}^{p}\leq B.

Moreover, for any $1<p\leq 2$ , it is known that $\|\mathbf{w}\|_{\ell_{2}}\leq\|\mathbf{w}\|_{\ell_{p}}$ for any $\mathbf{w}\in\mathbb{R}^{d}$ . This completes the proof of the lemma. ∎

Lemma 1 implies that the empirical solutions of (2.6) are in an $\ell_{p}$ -ball of certain radius which depends on the regularized parameter $\lambda$ . This shows that it suffices to analyze statistical behaviors of the given estimator projected into this ball.

3 Regularized Stochastic Algorithm

In order to effectively solve non-convex problems with massive data, practical algorithms for machine learning are increasingly constrained to spend less time and use less memory, and can also escape from saddle points that often appear in non-convex problems and tend to converge to a good stationary point. Stochastic gradient descent (SGD) is perhaps the simplest and most well studied algorithm that enjoys these advantages. The merits of SGD for large scale learning and the associated computation versus statistics tradeoffs is discussed in detail by the seminal work of Bottou and Bousquet (2007).

A standard assumption to analyze SGD in the literature is that the derivative of the objective function is Lipschitz smooth Verma and Zhang (2019); Hardt et al. (2016), however, the $\ell_{p}$ -regularized learning does not meet such condition. To address this issue, we propose a new SGD for (2.6) with an inexact proximal operator, and then develop a novel theoretical analysis for an upper bound of uniform stability Bousquet and Elisseeff (2002), which is an algorithm-dependent sensitivity-based measurement used for characterizing generalization performance in learning theory.

Given that a positive pair $(p,q)$ satisfies the equality $\nicefrac{{1}}{{p}}+\nicefrac{{1}}{{q}}=1$ , then the norms $\|\mathbf{w}\|_{p}$ and $\|\mathbf{w}\|_{q}$ are dual to each other. Moreover, the pair of functions $(1/2)\|\mathbf{w}\|^{2}_{p}$ and $(1/2)\|\mathbf{w}\|^{2}_{q}$ are conjugate functions of each other. As a consequence, their gradient mappings are a pair of inverse mapping. Formally, let $p\in(1,2]$ and $q=p/(p-1)$ , and define the mapping $\Phi:\mathrm{E}\rightarrow\mathrm{E}^{*}$ with

\Phi_{j}(\mathbf{w})=\nabla_{j}\big{(}\frac{1}{2}\|\mathbf{w}\|^{2}_{p}\big{)}=\frac{\hbox{sgn}(w_{j})|w_{j}|^{p-1}}{\|\mathbf{w}\|_{p}^{p-2}},\,j=1,2,...,d,

and the inverse mapping $\Phi^{-1}:\mathrm{E}^{*}\rightarrow\mathrm{E}$ with

\Phi^{-1}_{j}(\mathbf{v})=\nabla_{j}\big{(}\frac{1}{2}\|\mathbf{v}\|^{2}_{q}\big{)}=\frac{\hbox{sgn}(v_{j})|v_{j}|^{q-1}}{\|\mathbf{v}\|_{q}^{q-2}},\,j=1,2,...,d.

The above conjugate property on $\ell_{p}$ -space and $\ell_{q}$ -space is very useful for bounding uniform stability without the help of strong smoothness, while the latter is a standard assumption in optimization.

3.1 SGD with Inexact Proximal Operator

We write $L_{i}(\mathbf{w}):=\ell(y_{i},f(\mathbf{x}_{i},\mathbf{w}))$ for notational simplicity, and define a local quadratic approximation of $L_{i}$ at point $\mathbf{w}_{D,t}$ as

	$\displaystyle P_{i,r_{i}}(\mathbf{w},\mathbf{w}_{D,t}):=$	$\displaystyle L_{i}(\mathbf{w}_{D,t})+\langle\mathbf{w}-\mathbf{w}_{D,t},\nabla L_{i}(\mathbf{w}_{D,t})\rangle_{2}$
		$\displaystyle+r_{i}\\|\mathbf{w}-\mathbf{w}_{D,t}\\|_{2}^{2}.$

At each iteration $t$ , let $i_{t}$ be a random index sampled uniformly from $[n]$ on $D$ . Then replacing $\frac{1}{n}\sum_{i=1}^{n}L_{i}(\mathbf{w})$ in (2.6) by the quadratic term $P_{i_{t},r_{i_{t}}}(\mathbf{w},\mathbf{w}_{D,t})$ , we propose an inexact proximal method for SGD with the $\ell_{p}$ -regularizer. Precisely, we are concerned with the following iterative to update $\mathbf{w}_{D,t}$ for a regularized-based SGD,

\displaystyle\min_{\mathbf{w}}P_{i_{t},r_{i_{t}}}(\mathbf{w},\mathbf{w}_{D,t})+\lambda_{t}\|\mathbf{w}\|_{\ell_{p}}^{p}.

(3.1)

In view of the boundedness of $\|\widehat{\mathbf{w}}\|_{\ell_{2}}$ given in Lemma 1, we adopt the projection technique to execute a constrained SGD over the empirical risk term in (3.1). To this end, we define the projection onto a set $\mathcal{C}$ by

\Pi_{\mathcal{C}}(\mathbf{v}):=\arg\min_{\mathbf{w}\in\mathcal{C}}\|\mathbf{w}-\mathbf{v}\|_{2}.

This reveals that the definition of projection is an optimization problem in itself. In our case, the set we adopt is given as

\mathcal{C}:=\mathcal{C}_{\lambda}=\big{\{}\mathbf{w}\in\mathbb{R}^{d},\,\,\|\mathbf{w}\|_{2}\leq(B/\lambda)^{1/p}\big{\}}.

It is well known that, if $\mathcal{C}=\mathcal{B}_{2}(1)$ , i.e., the unit $\ell_{2}$ ball, then projection is equivalent to a normalization step

\Pi_{\mathcal{B}_{2}(1)}(\mathbf{v})=\begin{cases}\mathbf{v}/\|\mathbf{v}\|_{2}&\text{if}\,\,\|\mathbf{v}\|_{2}>1,\\ \mathbf{v}&\text{otherwise}.\end{cases}

Up to the terms that do not depend on $\mathbf{w}$ , summing the objective function in (3.1) and the projection formulate our proposed algorithm as follows

	$\displaystyle\mathbf{v}_{D,i_{t}}$	$\displaystyle=\Pi_{\mathcal{C}_{\lambda}}\big{(}\mathbf{w}_{D,t}-(\eta\nabla L_{i_{t}}(\mathbf{w}_{D,t})\big{)},$		(3.2)
	$\displaystyle\mathbf{w}_{D,t+1}$	$\displaystyle=\arg\min_{\mathbf{w}}\big{\{}\frac{1}{2}\big{\\|}\mathbf{w}-\mathbf{v}_{D,i_{t}}\big{\\|}_{2}^{2}+\lambda_{t}\\|\mathbf{w}\\|_{\ell_{p}}^{p}\big{\}},$		(3.3)

where $\eta>0$ is the learning rate that depends on $r_{i_{t}}$ , and the $\lambda_{t}$ may depend on $\lambda$ and $r_{i_{t}}$ .

Remark 1.

The update rule in (3.3) is seen as a contraction of conventional SGD, see Lemma 2 below for details. We will obtain analytical solutions of (3.3) for some specific $p$ (e.g. $p=1,2$ ). Although there is no analytical solutions for general $1<p\leq 2$ , the objective function in (3.3) is strongly convex over bounded domains and thus a global convergence can be guaranteed. For the ease of notation, we still denote by $\mathbf{w}_{D,t+1}$ the realized numerical solution with ignoring the inner optimization error.

Lemma 2.

For $1<p\leq 2$ and a vector $\mathbf{v}\in\mathbb{R}^{d}$ is given, we define

\displaystyle\mathbf{w}^{*}=\arg\min_{\mathbf{w}}\big{\{}\frac{1}{2}\|\mathbf{w}-\mathbf{v}\|_{2}^{2}+\lambda\|\mathbf{w}\|_{\ell_{p}}^{p}\big{\}}:=\hbox{Pro}_{\lambda,p}(\mathbf{v}).

(3.4)

Then, we conclude that

|w_{j}^{*}|\leq\min\{|v_{j}|,(|v_{j}|/(\lambda p))^{1/(p-1)}\},\quad\,\forall\,j=1,2...,d.

We defer the proof of Lemma 2 to the Appendix. Applying Lemma 2 and the projection onto $\mathcal{C}_{\lambda}$ in (3.2), we have the following inequality

\displaystyle\|\mathbf{w}_{D,t}\|_{2}\leq(B/\lambda)^{1/p},\quad\forall\,t,\lambda>0.

(3.5)

Consider a function $h:\mathbb{R}\rightarrow\mathbb{R}$ defined as $h(\theta)=|\theta|^{p}$ $(1<p\leq 2)$ . Note that it holds

h^{\prime}(\theta)=p.\hbox{sign}(\theta).|\theta|^{p-1}.

Obviously the first derivative of $h$ exists and is continuous for all $\theta\in\mathbb{R}$ . However, we notice that the inexact proximal operator $\hbox{Pro}_{\lambda,p}$ is not Lipschitz, due to the fact that $\|\cdot\|_{\ell_{p}}^{p}$ is not strongly smooth. Hence, as mentioned earlier, those conventional technique analysis under strongly smooth condition for objective functions are no longer valid in our case. Fortunately, the function $h$ is strongly convex over bounded domains, as shown in Lemma 1, which enables us to avoid restrictive smooth assumptions by an alternative proof strategy.

Refer to caption — Figure 1: Generalization Gaps as a function of the number of epochs under various $p$ on three datasets. We observe that the miniature $p$ achieves weaker generalization gap and thus worse than the significant $p$ .

4 Stability and Generalization Bounds

In this section, we provide algorithm-dependent generalization bounds via the notion of stability for $\ell_{p}$ -regularized GCN. To this end, we first introduce the notion of algorithmic stability and thereby present a generalization bound associated with the algorithmic stability.

Let $\mathcal{A}_{D}$ be a learning algorithm trained on dataset $D$ , which can be viewed as a map from $D\rightarrow\mathcal{H}$ . For GCN, we set $\mathcal{A}_{D}=f(\mathbf{x},\mathbf{w}_{D})$ . The overall learning performance of $\mathcal{A}_{D}$ is measured by the following expected risk:

R(\mathcal{A}_{D}):=\mathbb{E}_{\mathbf{z}}\big{[}\ell\big{(}y,f(\mathbf{x},\mathbf{w}_{D})\big{)}\big{]}.

Accordingly, the empirical risk of $\mathcal{A}_{D}$ with the loss $\ell$ is given as

R_{n}(\mathcal{A}_{D}):=\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{w}_{D})\big{)}.

Even when the sample is fixed, $\mathcal{A}_{D}$ may be still a randomized algorithm due to the randomness of algorithm procedure (e.g. SGD). In this context, we define the expected generalization error as

E_{\text{gen}}:=\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]},

where the expectation $\mathbb{E}_{\mathcal{A}}$ is taken over the inherent randomness of $\mathcal{A}_{D}$ .

For a randomized algorithm, to introduce the notation of its uniform stability, we need to define two datasets as follows. Given the training set $D$ defined as above, we introduce two related sets in the following:

Removing the $i$ -th data point in the set $D$ is represented as

D^{\backslash i}=\{\mathbf{z}_{1},\mathbf{z}_{2},...\mathbf{z}_{i-1},\mathbf{z}_{i+1},...,\mathbf{z}_{n}\},

and replacing the $i$ -th data point in $D$ by $\mathbf{z}^{{}^{\prime}}_{i}$ is represented as

D^{i}=\{\mathbf{z}_{1},\mathbf{z}_{2},...\mathbf{z}_{i-1},\mathbf{z}^{{}^{\prime}}_{i},\mathbf{z}_{i+1}...,\mathbf{z}_{n}\}.

Definition 1.

A randomized learning algorithm $\mathcal{A}_{D}$ is $\beta_{n}$ -uniformly stable with respect to a loss function $\ell$ , if it satisfies

\sup_{D,\mathbf{z}}\big{|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{\backslash i}}))]\big{|}\leq\beta_{n}.

By the triangle inequality, the following result on another uniform stability associated with $S^{i}$ holds

\sup_{D,\mathbf{z}}\big{|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i}}))]\big{|}\leq 2\beta_{n}.

Stability is property of a learning algorithm, roughly speaking, if two training samples are close to each other, a stable algorithm will generate close output results. There are many variants of stability, such as hypothesis stability Kearns and Ron (1999), sample average stability Shalev-Shwartz et al. (2010) and uniform stability. This paper will focus on the uniform stability, since it is closely related to other types of stability.

The following lemma shows that a randomized learning algorithm with uniform stability can guarantee meaningful generalization bound, which has been proved in Verma and Zhang (2019).

Lemma 3.

A uniform stable randomized algorithm $(\mathcal{A}_{D},\beta_{n})$ with a bounded loss function $0\leq\ell(y,f(\mathbf{x}))\leq B$ , satisfies the following generalization bound with probability at least $1-\delta$ , over the random draw of $D,\mathbf{z}$ with $\delta\in(0,1)$ ,

\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]}\leq 2\beta_{n}+(4n\beta_{n}+B)\sqrt{\frac{\log(1/\delta)}{2n}}.

We now give some smooth assumptions on loss function and activation function used for analyzing the stability of stochastic gradient descent. The following assumptions are very standard in optimization literature.

Assumption 1 (Smoothness for loss function and activation function).

We assume that the loss function is lipschitz continuous and smooth,

	$\displaystyle\|\ell(y,f(\cdot))-\ell(y,f^{\prime}(\cdot))\|\leq a_{\ell}\|f(\cdot)-f^{\prime}(\cdot)\|.\quad\forall\,f,f^{\prime}\in\mathcal{H}$
	$\displaystyle\|\ell^{\prime}(y,f(\cdot))-\ell^{\prime}(y,f^{\prime}(\cdot))\|\leq b_{\ell}\|f(\cdot)-f^{\prime}(\cdot)\|,$

where $a_{\ell}$ and $b_{\ell}$ are two positive constants. Similarly, the activation function also satisfies

	$\displaystyle\|\sigma(x)-\sigma(y)\|\leq a_{\sigma}\|x-y\|,\,\,$	$\displaystyle\|\sigma^{\prime}(x)-\sigma^{\prime}(y)\|\leq b_{\sigma}\|x-y\|\,$
	and	$\displaystyle\ell(y,f(\mathbf{x}))\leq B,\,\forall\,x,y\in\mathbb{R},$

where $a_{\sigma},\,b_{\sigma}$ and $B$ are positive constants as well.

We now present an explicit stability bound for GCN with $\ell_{p}$ -regularizer via stochastic gradient method.

Theorem 1.

Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed SGD given in (3.2)-(3.3) for iteration $T$ , is $\beta_{n}$ -uniformly stable, precisely

\displaystyle\beta_{n}\leq a_{\ell}^{2}a_{\sigma}^{2}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1},

where $C_{p,\lambda}:=\frac{28}{p(p-1)\lambda_{t}}\big{(}\nicefrac{{B}}{{\lambda}}\big{)}^{(3-p)/p}$ and $\mathbf{g}_{e}:=\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}$ .

The proof procedure of Theorem 1 will be given in Appendix. The key step of our proof is that in such a scenario, the error caused by the difference in the nonconvex empirical risk of GCN grows polynomially with the number of iterations.

Remark 2.

Theorem 1 provides the uniform stability for the last iterate of SGD for $\ell_{p}$ -regularized GCN. It is worth mentioning that the upper bound of this stability depends on the graph structure (i.e. $\mathbf{g}_{e}$ and $\lambda_{G}^{max}$ ) and the regularized hyper-parameter $p$ , as well as the sample size and the learning rate in SGD.

Remark 3.

More precisely, the result in Theorem 1 shows that the stability bound decreases inversely with the scale of $p$ . It increases as the graph structured parameter $\lambda_{G}^{max}$ increases.

Substituting Lemma 3 into Theorem 1 above, we obtain the generalization bounds with uniform stability for a single layer GCN with $\ell_{p}$ -regularizer.

Theorem 2.

Under the same conditions as Theorem 1. with a high probability, the following generalization bound holds

	$\displaystyle\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]}$
	$\displaystyle=\mathcal{O}\Big{(}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{\sqrt{n}}{\sum}_{t=1}^{T}\big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\big{)}^{t-1}\Big{)}.$

Remark 4.

Based on the result of Theorem 2, we conclude that $\ell_{p}$ -regularization for $1<p\leq 2$ generalizes. Note that, when $p\rightarrow 1$ , the stability bound breaks due to the sparsity of $\ell_{1}$ -regularization. The smaller $p$ becomes, the greater the stability parameter $\beta_{n}$ is, but at the same time the obtained parameter $\widehat{\mathbf{w}}$ in (2.6) tends to be sparse, shown in Koltchinskii (2009). These properties are also verified in our experimental evaluation.

Remark 5.

Although a small learning rate means a smaller generalization gap, the parameter range in which this SGD searches will be very small, resulting in a larger training error. Such conclusion is also applicable to various SGD for general models.

Dataset	Graph Filter	${p=1.001}$	${p=1.149}$	${p=1.320}$	${p=1.516}$	${p=1.741}$	${p=2}$
Citeseer	Augmented Normalized	$\mathbf{57.44\pm 0.87}$	$54.69\pm 0.82$	$52.41\pm 0.73$	$50.72\pm 0.67$	$50.27\pm 0.68$	$50.18\pm 0.69$
	Normalized	$\mathbf{57.02\pm 1.32}$	$54.47\pm 0.99$	$51.93\pm 0.66$	$50.39\pm 0.62$	$49.95\pm 0.62$	$49.88\pm 0.63$
	Random Walk	$\mathbf{56.17\pm 1.49}$	$53.84\pm 1.20$	$51.84\pm 0.84$	$50.54\pm 0.78$	$50.16\pm 0.81$	$50.13\pm 0.8$
	Unnormalized	$\mathbf{60.62\pm 2.25}$	$58.78\pm 2.31$	$57.52\pm 2.23$	$56.61\pm 2.25$	$56.32\pm 2.3$	$56.45\pm 2.2$
Cora	Augmented Normalized	$\mathbf{56.66\pm 3.43}$	$54.46\pm 2.68$	$52.18\pm 2.13$	$50.79\pm 1.86$	$50.41\pm 1.77$	$50.27\pm 1.76$
	Normalized	$\mathbf{55.77\pm 3.68}$	$53.74\pm 2.78$	$51.6\pm 2.10$	$50.22\pm 1.82$	$49.77\pm 1.82$	$49.69\pm 1.83$
	Random Walk	$\mathbf{53.43\pm 2.12}$	$52.03\pm 1.94$	$51.08\pm 1.37$	$50.29\pm 1.18$	$50.08\pm 1.18$	$50.07\pm 1.17$
	Unnormalized	$\mathbf{64.94\pm 3.61}$	$63.86\pm 3.5$	$63.31\pm 3.59$	$62.79\pm 3.53$	$62.87\pm 3.56$	$62.82\pm 3.5$
Pubmed	Augmented Normalized	$\mathbf{54.13\pm 2.16}$	$52.61\pm 2.62$	$50.83\pm 1.92$	$49.73\pm 1.67$	$49.24\pm 1.69$	$49.06\pm 1.63$
	Normalized	$\mathbf{53.27\pm 3.40}$	$52.48\pm 2.35$	$51.28\pm 2.09$	$50.29\pm 1.96$	$49.56\pm 1.78$	$49.30\pm 1.55$
	Random Walk	$\mathbf{54.72\pm 3.57}$	$53.07\pm 3.20$	$51.82\pm 2.94$	$51.03\pm 2.84$	$50.85\pm 2.8$	$50.77\pm 2.76$
	Unnormalized	$\mathbf{74.54\pm 6.26}$	$74.10\pm 6.40$	$73.29\pm 6.3$	$73.25\pm 6.57$	$72.91\pm 6.12$	$73.09\pm 5.96$

Table 1: The sparsity (

\%

) of obtained parameter

\widehat{\mathbf{w}}

under different

p

and normalization steps on three datasets.

5 Experiments

In this section, we conduct experiments to validate our theoretical findings. In particular, we show that there exists a stability-sparsity trade-off with varying $p$ , and the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter. To do it, we first introduce the experimental settings. Then we assess the uniform stability of GCN with semi-supervised learning tasks under varying $p$ . Finally, we evaluate the effect of different graph filters on the GCN stability bounds.

5.1 Experimental Setup

Datasets.

We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. (2008). In every dataset, each document is represented as spare bag-of-words feature vectors. The relationship between documents consists of a list of citation links, which can be treated as undirected edges and help construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label.

Baselines.

We implement several empirical experiments with a representative GCN model Kipf and Welling (2016a). For all datasets, we use 2-layer neural networks with 16 hidden units. In all cases, we evaluate the difference between the learned weight parameters and the generalization gap of two GCN models trained on datasets $D$ and $D^{i}$ , respectively. Specifically, we generate $D^{i}$ by choosing a random data point in the training set $D$ and altering it with a different random point. We also record the training and testing errors gap and the parameter distance $\sqrt{\|\widehat{\mathbf{w}}-\widehat{\mathbf{w}}^{{}^{\prime}}\|^{2}/(\|\widehat{\mathbf{w}}\|^{2}+\|\widehat{\mathbf{w}}^{{}^{\prime}}\|^{2})},$ where $\widehat{\mathbf{w}}$ and $\widehat{\mathbf{w}}^{{}^{\prime}}$ are the weight parameters of the respective models per epoch.

Training settings.

5.2 The Effect of Varying $p$

Generalization gap.

In this experiment, we empirically compare the effect of $p$ on the GCN stability bounds using different $p\in\{1.001,1.149,1.32,1.516,1.741,2\}$ . We quantitatively measure the generalization gap defined as the absolute difference between the training and testing errors and the difference between learned weights parameters of GCN models trained on $D$ and $D^{\prime}$ , respectively. It can be observed that in Figure 1, these empirical observations are in line with our stability bounds (see also results in Figure 3 in Appendix).

Sparsity.

Besides the convergence, we have a particular concern about the sparsity of the solutions. The sparsity ratios for $l_{p}$ -based method are summarized in Table 1. Observe that $\ell_{p}$ -regularized learning with $p\rightarrow 1$ identifies most of the sparsity pattern but behaves much worse in generalization.

5.3 The Effect of Graph Filters

Different normalizations steps.

In this experiment, we mainly consider the implication of our results in following designing graph convolution filters: (1) Unnormalized Graph Filters: $g(\mathbf{L})=\mathbf{A}+\mathbf{I}$ ; (2) Normalized Graph Filters: $g(\mathbf{L})=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}+\mathbf{I}$ ; (3) Random Walk Graph Filters: $g(\mathbf{L})=\mathbf{D}^{-1}\mathbf{A}+\mathbf{I}$ ; (4) Augmented Normalized Graph Filters: $g(\mathbf{L})=(\mathbf{D}+\mathbf{I})^{-1/2}(\mathbf{A}+\mathbf{I})(\mathbf{D}+\mathbf{I})^{-1/2}$ .

In this experiment, we quantitatively measure the generalization gap and parameter distance per epoch. From Figure 2, it is clear that the Unnormalized Graph Filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets. Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap and parameter distance become constant after a certain number of iterations. More results can be found in the supplementary material.

6 Conclusion

In this paper, we present an explicit theoretical understanding of stochastic learning for GCN with the $l_{p}$ regularizer and analyze the stability of our regularized stochastic algorithm. In particular, our derived bounds show that the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter, and there exists a generalization-sparsity tradeoff with varying $p$ . It is worth noting that previous generalization analysis based on stability notation assumed that objectiveness is a second derivative function, which is no longer applicable to our $l_{p}$ -regularized learning scheme. To address this issue, we propose a new proximal SGD algorithm for GCNs with an inexact operator, which exhibits comparable empirical performances.

Acknowledgements

Lv’s research is supported by the “QingLan” Project of Jiangsu Universities. Ming Li would like to thank the support from the National Natural Science Foundation of China (No. U21A20473, No. 62172370) and Zhejiang Provincial Natural Science Foundation (No. LY22F020004).

Contribution Statement

Linsen We and Ming Li contributed equally to this work.

References

Barrett et al. [2018] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pages 511–520. PMLR, 2018.
Battaglia et al. [2016] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, pages 4502–4510, 2016.
Bottou and Bousquet [2007] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2007.
Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
Bruna et al. [2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, volume 29, pages 3844–3852, 2016.
Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, volume 28, pages 2224–2232, 2015.
Garg et al. [2020] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pages 3419–3430. PMLR, 2020.
Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035, 2017.
Hardt et al. [2016] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, pages 2323–2332. PMLR, 2018.
Kearns and Ron [1999] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999.
Kipf and Welling [2016a] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Kipf and Welling [2016b] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
Koltchinskii [2009] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. In Annales de l’IHP Probabilités et statistiques, volume 45, pages 7–57, 2009.
Liao et al. [2020] Renjie Liao, Raquel Urtasun, and Richard Zemel. A pac-bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations, 2020.
Liu et al. [2021] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, pages 6837–6849. PMLR, 2021.
Ma et al. [2021] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
Nie et al. [2011] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Unsupervised and semi-supervised learning via $\mathscr{l}$ 1-norm graph. In 2011 International Conference on Computer Vision, pages 2268–2273. IEEE, 2011.
Oono and Suzuki [2020] Kenta Oono and Taiji Suzuki. Optimization and generalization analysis of transduction through gradient boosting and application to multi-scale graph neural networks. Advances in Neural Information Processing Systems, 33:18917–18930, 2020.
Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.
Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
Shalev-Shwartz et al. [2010] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
Tibshirani [2014] Ryan J Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
Veličković et al. [2017] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
Verma and Zhang [2019] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1539–1548, 2019.
Wang et al. [2015] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan Tibshirani. Trend filtering on graphs. In Artificial Intelligence and Statistics, pages 1042–1050. PMLR, 2015.
Wibisono et al. [2009] Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio. Sufficient conditions for uniform stability of regularization algorithms. Computer Science and Artificial Intelligence Laboratory Technical Report, MIT-CSAIL-TR-2009-060, 2009.
Xu et al. [2011] Huan Xu, Constantine Caramanis, and Shie Mannor. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on Pattern Analysis and Machine Intelligence, 34(1):187–193, 2011.
Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
Yun et al. [2019] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in Neural Information Processing Systems, 32:11983–11993, 2019.
Zhang et al. [2020] Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, and Le Song. Efficient probabilistic logic reasoning with graph neural networks. arXiv preprint arXiv:2001.11850, 2020.

The appendices are structured as follows. In Appendix A.1, we provide the proof procedure of Theorem 1. The details of proof for Lemma 2 is sketched in In Appendix B. In Appendix C, we provide additional experiments and detailed setups.

Appendix A Proof of Theorem

A.1 Proof Sketch

We describe the main procedures by training a GCN using SGD on datasets $D$ and $D^{i}$ which differ in one data point. Let $Z=\{\mathbf{z}_{1},...,\mathbf{z}_{t},...,\mathbf{z}_{T}\}$ be a sequence of samples, where $\mathbf{z}_{t}$ is an i.i.d. sample drawn from $D$ at the $t$ -th iteration of SGD during a training run of the GCN. Training the same GCN using SGD on $D^{i}$ means that we endow the same sequence to the GCN except that if $\mathbf{z}_{t}=(\mathbf{x}_{i},y_{i})$ occurs for some $t$ , we replace it with $\mathbf{z}_{t}^{\prime}=(\mathbf{x}_{i}^{\prime},y^{\prime}_{i})$ . We denote this sample sequence by $Z^{\prime}$ . Let $\{\mathbf{w}_{D,0},\mathbf{w}_{D,1},...,\mathbf{w}_{D,T}\}$ and $\{\mathbf{w}_{D^{i},0},\mathbf{w}_{D^{i},1},...,\mathbf{w}_{D^{i},T}\}$ be the corresponding sequences of the weight parameters learning by running SGD on $D$ and $D^{i}$ , respectively.

Given two sequences of the weighted parameters, $\{\mathbf{w}_{D,0},\mathbf{w}_{D,1},...,\mathbf{w}_{D,T}\}$ and $\{\mathbf{w}_{D^{i},0},\mathbf{w}_{D^{i},1},...,\mathbf{w}_{D^{i},T}\}$ . At each iteration, we define the difference of these two sequences by

\Delta\mathbf{w}_{t}:=\mathbf{w}_{D^{i},t}-\mathbf{w}_{D,t}.

We now relate $\Delta\mathbf{w}_{t}$ to the stability parameter $\beta_{n}$ given in Definition 1. Since the loss is Lipschitz continuous by Assumption 1, for any $\mathbf{z}=(\mathbf{x},y)$ we have

	$\displaystyle\big{\|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i}}))]\big{\|}$
$\displaystyle\leq$	$\displaystyle a_{\ell}\mathbb{E}_{\mathcal{A}}[\|f(\mathbf{x},\mathbf{w}_{D})-f(\mathbf{x},\mathbf{w}_{D^{i}})\|]$
$\displaystyle\leq$	$\displaystyle a_{\ell}a_{\sigma}\mathbb{E}_{\mathcal{A}}\Big{[}\Big{\|}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D}-\mathbf{w}_{D^{\prime}})\Big{\|}\Big{]}$
$\displaystyle\leq$	$\displaystyle a_{\ell}a_{\sigma}\mathbb{E}_{\mathcal{A}}[\\|\Delta\mathbf{w}\\|_{2}]\cdot\big{\\|}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2},$	(A.1)

where the second inequality follows from the Lipschitz property of $\sigma(\cdot)$ , and the last one is based on the Cauchy-Schwarz inequality.

As proved in Verma and Zhang [2019], $\mathbf{g}_{e}=\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}$ can be bounded in terms of the largest absolute eigenvalue of the graph convolution filter $g(\mathbf{L})$ . Thus, to provide an upper bound of $\beta_{n}$ in uniform stability, the key point is to bound $\mathbb{E}_{\mathcal{A}}[\|\Delta\mathbf{w}\|_{2}]$ involved in (A.1).

A.2 Bounding $\|\Delta\mathbf{w}_{t+1}\|_{2}$

This subsection is devoted to bounding $\|\Delta\mathbf{w}_{t+1}\|_{2}$ due to the randomness inherent in SGD. This will be proved though a series of five lemmas. Different from existing related work Verma and Zhang [2019]; Hardt et al. [2016], the strong convexity of $|x|^{p}$ over bounded domains plays an important role in our desired results. It should be pointed out that, the first derivative of our objective function is not Lipschitz, the standard technique for proving the stability of SGD Hardt et al. [2016] is no longer applicable to our setting.

Proposition 1.

Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed algorithm (3.2)-(3.3) at step $t$ , satisfies

\displaystyle\|\Delta\mathbf{w}_{t+1}\|_{2}\leq C_{p,\lambda}\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}.

(A.2)

At step $t$ , $\mathbf{z}_{t}=(\mathbf{x}_{t},y_{t})$ is a drawn sample from $D$ and $\mathbf{z}_{t}^{\prime}=(\mathbf{x}_{t}^{\prime},y_{t}^{\prime})$ is a drawn sample from $D^{i}$ . Here $C_{p,\lambda}$ is defined as Theorem 1.

Proof of Proposition 1.

For every $a,b\in\mathbb{R}$ , we know the following equality from (4) in Wibisono et al. [2009] that

\displaystyle|a|^{p}+|b|^{p}-2\big{|}\frac{a+b}{2}\big{|}^{p}=\frac{1}{4}(b-a)^{2}p(p-1)|c|^{p-2},

(A.3)

where $\min\{a,b\}\leq c\leq\max\{a,b\}$ and the choice of $c$ in our case will be given later. The equality (A.3) shows that $|x|^{p}$ $(1<p\leq 2)$ is strongly convex over bounded domains, even though its first derivative is not Lipschitz smooth.

In particular, we can set $a$ and $b$ to be the $j$ -th components $\mathbf{w}^{j}_{D,t+1}$ of $\mathbf{w}_{D,t+1}$ and $\mathbf{w}^{j}_{D^{i},t+1}$ of $\mathbf{w}_{D^{i},t+1}$ respectively. Then we obtain

\displaystyle|\mathbf{w}^{j}_{D,t+1}|^{p}+|\mathbf{w}^{j}_{D^{i},t+1}|^{p}-2\Big{|}\frac{\mathbf{w}^{j}_{D,t+1}+\mathbf{w}^{j}_{D^{i},t+1}}{2}\Big{|}^{p}=\frac{1}{4}(\mathbf{w}^{j}_{D,t+1}-\mathbf{w}^{j}_{D^{i},t+1})^{2}p(p-1)|c_{j}|^{p-2},

(A.4)

where $\min\{\mathbf{w}^{j}_{D,t+1},\mathbf{w}^{j}_{D^{i},t+1}\}\leq c_{j}\leq\max\{\mathbf{w}^{j}_{D,t+1},\mathbf{w}^{j}_{D^{i},t+1}\}$ . As stated in (3.5), we can derive a bound $c_{j}$ as follows

|c_{j}|\leq\max\{|\mathbf{w}^{j}_{D,t+1}|,|\mathbf{w}^{j}_{D^{i},t+1}|\}\leq\Big{(}\frac{B}{\lambda}\Big{)}^{1/p}.

Furthermore, since $1<p\leq 2$ , this follows that

|c_{j}|^{p-2}\geq\Big{(}\frac{B}{\lambda}\Big{)}^{(p-2)/p},\quad\forall\,j\in[d].

Substituting the bound on $c_{j}$ to (A.4) and then summing over $j\in[d]$ , we obtain

		$\displaystyle\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}+\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-2\Big{\\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\\|}^{p}_{\ell_{p}}$
		$\displaystyle\geq\frac{1}{4}p(p-1)\Big{(}\frac{B}{\lambda}\Big{)}^{(p-2)/p}\\|\mathbf{w}_{D,t+1}-\mathbf{w}_{D^{i},t+1}\\|_{\ell_{2}}^{2}.$		(A.5)

Seen from (A.2), it suffices to bound the term

\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-2\Big{\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\|}^{p}_{\ell_{p}},

which is completed by the following lemma. ∎

Lemma 4.

For any $\theta\in[0,1]$ , we have

	$\displaystyle\lambda_{t}\big{(}\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}+\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}\big{)}$
	$\displaystyle\leq 7(B/\lambda)^{1/p}\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)}.$

Proof.

The proof for Lemma 4 is provided in Appendix B.2. ∎

Now we proceed with the proof of Proposition 1. Applying Lemma 4 with $\theta=\frac{1}{2}$ yields that

	$\displaystyle\lambda_{t}\Big{(}\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}+\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-2\Big{\\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\\|}^{p}_{\ell_{p}}\Big{)}$
	$\displaystyle\leq 7\big{(}\frac{B}{\lambda}\big{)}^{1/p}\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)}.$

Combining the inequality above with (A.2), we get

\displaystyle\frac{\lambda_{t}}{28}p(p-1)\big{(}\frac{B}{\lambda}\big{)}^{(p-3)/p}\|\mathbf{w}_{D,t+1}-\mathbf{w}_{D^{i},t+1}\|_{\ell_{2}}^{2}\leq\|\Delta\mathbf{w}_{t}\|_{2}+\eta\big{\|}\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\big{\|}_{2},

which immediately yields our desired result in Proposition 1.

Since $\mathbf{z}_{t}$ and $\mathbf{z}_{t}^{\prime}$ are two random samples from different datasets ( $D$ and $D^{i}$ ) respectively, we need to consider one of the following two scenarios, which must occur at iteration $t$ .

(Scenario 1) At step $t$ , SGD picks a sample $\mathbf{z}_{t}=\mathbf{z}_{t}^{\prime}=(\mathbf{x},y)$ which is identical in $Z$ and $Z^{\prime}$ , occurring with probability $(n-1)/n$ .

(Scenario 2) At step $t$ , SGD picks the only samples that $Z$ and $Z^{\prime}$ differs, e.g. $\mathbf{z}_{t}\neq\mathbf{z}_{t}^{\prime}$ , which occurs with probability $1/n$ .

Consider the Scenario $1$ in Lemma 5 below. In this case, we will show that $\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\|_{2}$ can be bounded by the terms $\|\Delta\mathbf{w}_{t}\|_{2}$ and $\mathbf{g}_{e}^{2}$ .

Lemma 5.

Under Assumption 1 for Scenario 1. There holds

\displaystyle\big{\|}\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\big{\|}_{2}\leq(a_{\sigma}^{2}+a_{\ell})\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}^{2}\|\Delta\mathbf{w}_{t}\|_{2}.

Proof.

The proof for Lemma 5 is provided in Appendix B.3. ∎

Unlike Scenario 1 given in Lemma 5, the following lemma gives a more rough bound of the gradient difference under Scenario 2.

Lemma 6.

Under Assumption 1 for Scenario 2. There holds

\displaystyle\big{\|}\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\big{\|}_{2}\leq 2a_{\ell}a_{\sigma}\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}.

Proof.

The proof for Lemma 6 is also provided in Appendix B.4. ∎

Substituting the results derived in Lemma 5 and Lemma 6 to Proposition 1, and taking expectation over all possible sample sequences $\mathbf{z},\mathbf{z}^{\prime}$ from $D$ and $D^{i}$ , we have the following lemma.

Lemma 7.

\displaystyle\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{T}\|_{2}\big{]}\leq\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1}.

Proof of Lemma 7..

From (A.2) in Proposition 1, this together with the derived results in Lemma 5 and Lemma 6 yields that, $\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t+1}\|_{2}\big{]}$ can be upper bounded by

	$\displaystyle C_{p,\lambda}\Big{(}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+\mathbb{E}_{SGD}\big{[}\eta\\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\\|_{2}\big{]}\Big{)}$
	$\displaystyle\leq C_{p,\lambda}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+2a_{\ell}a_{\sigma}C_{p,\lambda}\frac{\eta}{n}\sup_{\mathbf{x}}\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}$
	$\displaystyle\quad+(a_{\sigma}^{2}+a_{\ell})C_{p,\lambda}\eta\big{(}1-\frac{1}{n}\big{)}\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}^{2}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}$
	$\displaystyle=C_{p,\lambda}\Big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\big{(}1-\frac{1}{n}\big{)}\mathbf{g}_{e}^{2}\Big{)}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}.$

Finally, an application of the recursion for the above inequality leads to

\displaystyle\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{T}\|_{2}\big{]}\leq\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1},

where we use the fact that the parameter initialization is kept same for $D$ and $D^{i}$ , e.g., $\mathbf{w}_{D,0}=\mathbf{w}_{D^{i},0}$ . Thus, the proof for Lemma 7 is completed. ∎

Appendix B Proof for Lemmas

B.1 Deferred Proof for Lemma 2.

Proof for Lemma 2..

Since the regularization term $\|\mathbf{w}\|_{\ell_{p}}^{p}$ is continuously differential and specially we get $\partial(|w_{j}|^{p})=p\cdot\hbox{sign}(w_{j})\cdot|w_{j}|^{p-1}$ , where $\mathbf{w}=(w_{1},w_{2},...,w_{d})$ . Note that all the terms in (3.4) are separable on $w_{j}$ ’s, and using the first-order condition for (3.4) leads to

\displaystyle w_{j}^{*}-v_{j}+\lambda p\cdot\hbox{sign}(w_{j}^{*})\cdot|w_{j}^{*}|^{p-1}=0,\quad j=1,2...,d.

(B.1)

If $w^{*}_{j}>0$ , this follows from (B.1) that

w_{j}^{*}+\lambda p\cdot|w_{j}^{*}|^{p-1}=v_{j},

implying that

0<w^{*}_{j}\leq\min\{v_{j},(v_{j}/(\lambda p))^{1/(p-1)}\}.

Otherwise, we obtain from (B.1) that

-|w_{j}^{*}|-\lambda p\cdot|w_{j}^{*}|^{p-1}=v_{j},

implying that

|w_{j}^{*}|\leq\min\{|v_{j}|,(|v_{j}|/(\lambda p))^{1/(p-1)}\}.

Summing the previous two conclusions leads to our desired result. ∎

B.2 Deferred Proof for Lemma 4.

Proof for Lemma 4..

Let us introduce the notation:

R_{D}(\mathbf{w}):=\frac{1}{2}\|\mathbf{w}-\mathbf{v}_{D,i_{t}}\|_{2}^{2},

and similarly

R_{D^{i}}(\mathbf{w}):=\frac{1}{2}\|\mathbf{w}-\mathbf{v}_{D^{i},i_{t}}\|_{2}^{2}.

Recall that a convex function $g$ admits the following inequality:

g(\mathbf{x}+\theta(\mathbf{u}-\mathbf{x}))-g(\mathbf{x})\leq\theta(g(\mathbf{u})-g(\mathbf{x})),\quad\forall\,\mathbf{x},\mathbf{u}.

The convexity of $R_{D}(\mathbf{w})$ and $R_{D^{i}}(\mathbf{w})$ immediately leads to

\displaystyle R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})\leq\theta\big{(}R_{D^{i}}(\mathbf{w}_{D^{i},t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})\big{)}.

Then, switching the role of $\mathbf{w}_{D,t+1}$ and $\mathbf{w}_{D^{i},t+1}$ , we get

\displaystyle R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\leq\theta\big{(}R_{D^{i}}(\mathbf{w}_{D,t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\big{)}.

Summing the two previous inequalities yields

\displaystyle R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})+R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\leq 0.

(B.2)

Now, by the definitions of $\mathbf{w}_{D,t+1}$ and $\mathbf{w}_{D^{i},t+1}$ , we have

	$\displaystyle R_{D}(\mathbf{w}_{D,t+1})+\lambda_{t}\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}-\big{[}R_{D}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})+\lambda_{t}\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}\big{]}\leq 0,$
	$\displaystyle R_{D^{i}}(\mathbf{w}_{D^{i},t+1})+\lambda_{t}\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-\big{[}R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})+\lambda_{t}\\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}\big{]}\leq 0.$

Then, summing the two previous inequalities and using (B.2) and (3.5), we obtain

	$\displaystyle\lambda_{t}\big{(}\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}+\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}\big{)}$
	$\displaystyle\leq R_{D}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})+R_{D^{i}}(\mathbf{w}_{D,t+1})-R_{D}(\mathbf{w}_{D,t+1})$
	$\displaystyle\leq\big{(}\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\\|_{2}+\\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}\big{)}\\|\mathbf{v}_{D,i_{t}}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}$
	$\displaystyle\leq\big{(}\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\\|_{2}+\\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}\big{)}$
	$\displaystyle\quad\times\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)}$
	$\displaystyle\leq 7(B/\lambda)^{1/p}\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)},$		(B.3)

where we use the well-known property that projection onto a convex set $\mathcal{C}$ is non-expansive, that is, for any two points $\mathbf{x},\mathbf{u}$ , $\|\Pi_{\mathcal{C}}(\mathbf{x})-\Pi_{\mathcal{C}}(\mathbf{u})\|_{2}\leq\|\mathbf{x}-\mathbf{u}\|_{2}$ . This completes the proof of Lemma 4. ∎

B.3 Deferred Proof for Lemma 5.

Proof for Lemma 5..

Recall that, the predictor function is defined as $f(\mathbf{x},\mathbf{w})=\sigma\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)}$ . The gradient of $f(\mathbf{x},\mathbf{w})$ can be computed by

\nabla f(\mathbf{x},\mathbf{w})=\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}.

Using the chain-rule of compositional functions, we have

		$\displaystyle\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))$
		$\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\nabla f(\mathbf{x},\mathbf{w}_{D,t})-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\nabla f(\mathbf{x},\mathbf{w}_{D^{i},t})$
		$\displaystyle=\Big{[}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{]}\times\Big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\Big{)}.$		(B.4)

Under Assumption 1, an application of the triangle inequality leads to

	$\displaystyle\Big{\|}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{\|}$
	$\displaystyle\leq\Big{\|}\big{[}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\big{]}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}\Big{\|}$
	$\displaystyle\quad+\Big{\|}\Big{[}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{]}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\Big{\|}$
	$\displaystyle\leq a_{\sigma}\big{\|}f(\mathbf{x},\mathbf{w}_{D,t})-f(\mathbf{x},\mathbf{w}_{D^{i},t})\big{\|}+a_{\ell}\Big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{\|}$
	$\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\Big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{\|}$
	$\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}\\|\Delta\mathbf{w}_{t}\\|_{2}.$

This together with (B.3) yields the desired result. ∎

B.4 Deferred Proof for Lemma 6.

Proof for Lemma 6..

Again using the chain-rule of compositional functions, we have

		$\displaystyle\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))$
		$\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\nabla f(\mathbf{x},\mathbf{w}_{D,t})-\ell^{\prime}(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\nabla f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t})$
		$\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}\Big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\Big{)}-\ell^{\prime}(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}^{\prime}}}e_{\cdot j}(\mathbf{x}^{\prime})_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{(}\sum_{j\in N_{\mathbf{x}}^{\prime}}e_{\cdot j}\mathbf{x}^{\prime}_{j}\Big{)}$
		$\displaystyle\leq 2a_{\ell}a_{\sigma}\sup_{\mathbf{x}}\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2},$		(B.5)

where we use the fact from Assumption 1 that the first order derivatives of $\ell$ and $\sigma$ are both bounded. This completes the proof of Lemma 6. ∎

B.5 Additional Lemma

Lemma 8.

Let $G=(V,E)$ be an undirected graph with a weighted adjacency matrix $g(\mathbf{L})$ and $\lambda^{max}_{G}$ be the maximum absolute eigenvalue of $g(\mathbf{L})$ . Let $G_{\mathbf{x}}$ be the ego-graph of a node $\mathbf{x}\in V$ with corresponding maximum absolute eigenvalue $\lambda^{max}_{G_{\mathbf{x}}}$ . Then the following eigenvalue bound holds for all $\mathbf{x}$ ,

\lambda^{max}_{G_{\mathbf{x}}}\leq\lambda^{max}_{G}.

Lemma 8 is proved in reference Verma and Zhang [2019], where they also showed that $\mathbf{g}_{e}$ can be upper bounded in term of the largest absolute eigenvalue of $g_{\mathbf{x}}(\mathbf{L})$ . That is, $\mathbf{g}_{e}\leq\lambda^{max}_{G_{\mathbf{x}}}$ for all $\mathbf{x}$ . This follows from Lemma 8 that

\displaystyle\mathbf{g}_{e}\leq\lambda^{max}_{G}.

(B.6)

Finally, plugging (B.6) and Lemma 7 into (A.1) yields the following result:

\displaystyle\beta_{n}\leq a_{\ell}^{2}a_{\sigma}^{2}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1}.

(B.7)

Appendix C Additional Experiments and Setup Details

C.1 General Setup

We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. [2008]. In every datasets, each document represents as spare bag-of-words feature vector, and the relationship between documents consists of a list of citation links, which can be treated as undirected edges and help to construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label. The data statistics for the datasets used in Section 5 are summarized in Table 2.

Dataset	Nodes	Edges	Classes	Features
Citeseer	$3,327$	$4,732$	$6$	$3,703$
Cora	$2,708$	$5,429$	$7$	$1,433$
Pubmed	$19,717$	$44,338$	$3$	$500$

Table 2: Dataset Statistics.

C.2 Additional Experimental Results

In this section, we investigate the stability and generalization of $\ell_{p}$ -regularized stochastic learning for GCN under different normalization steps. We mainly consider the implication of our results in following designing graph convolution filters:

(I)

Unnormalized Graph Filters: $g(\mathbf{L})=\mathbf{A}+\mathbf{I}$ ;
(II)

Normalized Graph Filters: $g(\mathbf{L})=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}+\mathbf{I}$ ;
(III)

Random Walk Graph Filters: $g(\mathbf{L})=\mathbf{D}^{-1}\mathbf{A}+\mathbf{I}$ ;
(IV)

Augmented Normalized Graph Filters: $g(\mathbf{L})=(\mathbf{D}+\mathbf{I})^{-1/2}(\mathbf{A}+\mathbf{I})(\mathbf{D}+\mathbf{I})^{-1/2}$ .

For each experiment, we initialize the parameters of GCN models with the same random seeds and then train all models for a maximum of 200 epochs using the proposed Inexact Proximal SGD. We repeat the experiments ten times and report the average performance and the standard variance. We quantitatively measure the generalization gap defined as the absolute difference between the training and test errors and the parameter distance $\sqrt{\nicefrac{{\|\widehat{\mathbf{w}}-\widehat{\mathbf{w}}^{\prime}\|^{2}}}{{\big{(}\|\widehat{\mathbf{w}}\|^{2}+\|\widehat{\mathbf{w}}^{\prime}\|^{2}\big{)}}}}$ between the parameters $\widehat{\mathbf{w}}$ and $\widehat{\mathbf{w}}^{\prime}$ of two models trained on two copies of the data differing in a random substitution.

The unnormalized graph convolution filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets under different $p$ . Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap becomes constant after a certain number of iterations.

	$\displaystyle C_{p,\lambda}\Big{(}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+\mathbb{E}_{SGD}\big{[}\eta\\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\\|_{2}\big{]}\Big{)}$
	$\displaystyle\leq C_{p,\lambda}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+2a_{\ell}a_{\sigma}C_{p,\lambda}\frac{\eta}{n}\sup_{\mathbf{x}}\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}$
	$\displaystyle\quad+(a_{\sigma}^{2}+a_{\ell})C_{p,\lambda}\eta\big{(}1-\frac{1}{n}\big{)}\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}^{2}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}$
	$\displaystyle=C_{p,\lambda}\Big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\big{(}1-\frac{1}{n}\big{)}\mathbf{g}_{e}^{2}\Big{)}\mathbb{E}_{SGD}\big{[}\\|\Delta\mathbf{w}_{t}\\|_{2}\big{]}+\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}.$

	$\displaystyle\lambda_{t}\big{(}\\|\mathbf{w}_{D,t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}+\\|\mathbf{w}_{D^{i},t+1}\\|_{\ell_{p}}^{p}-\\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\\|_{\ell_{p}}^{p}\big{)}$
	$\displaystyle\leq R_{D}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})+R_{D^{i}}(\mathbf{w}_{D,t+1})-R_{D}(\mathbf{w}_{D,t+1})$
	$\displaystyle\leq\big{(}\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\\|_{2}+\\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}\big{)}\\|\mathbf{v}_{D,i_{t}}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}$
	$\displaystyle\leq\big{(}\\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\\|_{2}+\\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\\|_{2}\big{)}$
	$\displaystyle\quad\times\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)}$
	$\displaystyle\leq 7(B/\lambda)^{1/p}\big{(}\\|\Delta\mathbf{w}_{t}\\|_{2}+\eta\\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\\|_{2}\big{)},$		(B.3)

	$\displaystyle\Big{\|}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{\|}$
	$\displaystyle\leq\Big{\|}\big{[}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\big{]}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}\Big{\|}$
	$\displaystyle\quad+\Big{\|}\Big{[}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{]}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\Big{\|}$
	$\displaystyle\leq a_{\sigma}\big{\|}f(\mathbf{x},\mathbf{w}_{D,t})-f(\mathbf{x},\mathbf{w}_{D^{i},t})\big{\|}+a_{\ell}\Big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{\|}$
	$\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\Big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{\|}$
	$\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\big{\\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\\|}_{2}\\|\Delta\mathbf{w}_{t}\\|_{2}.$

Stability and Generalization of ℓp\ell_{p}-Regularized Stochastic Learning for GCN

Abstract

1 Introduction

1.1 Additional Related Work

Generalization Analysis for GNNs.

Regularized Schemes on GNNs.

2 Preliminaries and Methodology

2.1 Empirical Risk Minimization with ℓp\ell_{p}-Regularizer

Lemma 1.

Proof of Lemma 1..

3 Regularized Stochastic Algorithm

3.1 SGD with Inexact Proximal Operator

Remark 1.

Lemma 2.

4 Stability and Generalization Bounds

Definition 1.

Lemma 3.

Assumption 1 (Smoothness for loss function and activation function).

Theorem 1.

Remark 2.

Remark 3.

Theorem 2.

Remark 4.

Remark 5.

5 Experiments

5.1 Experimental Setup

Datasets.

Baselines.

Training settings.

5.2 The Effect of Varying pp

Generalization gap.

Sparsity.

5.3 The Effect of Graph Filters

Different normalizations steps.

6 Conclusion

Acknowledgements

Contribution Statement

References

Appendix A Proof of Theorem

A.1 Proof Sketch

A.2 Bounding ‖Δ​𝐰t+1‖2\|\Delta\mathbf{w}_{t+1}\|_{2}

Proposition 1.

Proof of Proposition 1.

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Lemma 7.

Proof of Lemma 7..

Appendix B Proof for Lemmas

B.1 Deferred Proof for Lemma 2.

Proof for Lemma 2..

B.2 Deferred Proof for Lemma 4.

Proof for Lemma 4..

B.3 Deferred Proof for Lemma 5.

Proof for Lemma 5..

B.4 Deferred Proof for Lemma 6.

Proof for Lemma 6..

B.5 Additional Lemma

Lemma 8.

Appendix C Additional Experiments and Setup Details

C.1 General Setup

C.2 Additional Experimental Results

Stability and Generalization of $\ell_{p}$ -Regularized Stochastic Learning for GCN

2.1 Empirical Risk Minimization with $\ell_{p}$ -Regularizer

5.2 The Effect of Varying $p$

A.2 Bounding $\|\Delta\mathbf{w}_{t+1}\|_{2}$