This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stability and Generalization of p\ell_{p}-Regularized Stochastic Learning for GCN

Anonymous Author(s)    Shiyu Liu1    Linsen Wei2111Equal contribution.    Shaogao Lv3222Corresponding author. E-mail: [email protected] &Ming Li4111Equal contribution. 1University of Electronic Science and Technology of China, China
2School of Astronautics, Northwestern Polytechnical University, China
3Department of Statistics and Data Science, Nanjing Audit University, China
4Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, China
Abstract

Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data and have shown powerful performance in empirical experiments. That 2\ell_{2}-based graph smoothing enforces the global smoothness of GCN, while (soft) 1\ell_{1}-based sparse graph learning tends to promote signal sparsity to trade for discontinuity. This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general p\ell_{p}-regularized (1<p2)(1<p\leq 2) stochastic learning proposed within. While stability-based generalization analyses have been given in prior work for a second derivative objectiveness function, our p\ell_{p}-regularized learning scheme does not satisfy such a smooth condition. To tackle this issue, we propose a novel SGD proximal algorithm for GCNs with an inexact operator. For a single-layer GCN, we establish an explicit theoretical understanding of GCN with the p\ell_{p}-regularized stochastic learning by analyzing the stability of our SGD proximal algorithm. We conduct multiple empirical experiments to validate our theoretical findings.

1 Introduction

Graph Neural Networks (GNNs) have emerged as a family of powerful model designs for improving the performance of neural network models on graph-structured data. GNNs have delivered remarkable empirical performance from a diverse set of domains, such as social networks, knowledge graphs, and biological networks Duvenaud et al. (2015); Battaglia et al. (2016); Defferrard et al. (2016); Jin et al. (2018); Barrett et al. (2018); Yun et al. (2019); Zhang et al. (2020). In fact, GNNs can be viewed as natural extensions of conventional machine learning for any data where the available structure is given by pairwise relationships.

The architecture designs of various GNNs have been motivated mainly by spectral domain Defferrard et al. (2016); Kipf and Welling (2016a) and spatial domain Hamilton et al. (2017); Gilmer et al. (2017). Some popular variants of graph neural networks include Graph Convolutional Network (GCN) Bruna et al. (2013), GraphSAGE Hamilton et al. (2017), Graph Attention Network Veličković et al. (2017), Graph Isomorphism Network Xu et al. (2018), and among others.

Specifically, inherited excellent performances of traditional convolutional neural networks in processing image and time series, a standard GCN Kipf and Welling (2016a) also consists of a cascade of layers, but operates directly on a graph and induces embedding vectors of nodes based on properties of their neighborhoods. Formally, GCN is defined as the problem of learning filter parameters in the graph Fourier transform. GCNs have shown superior performances on real datasets from various domains, such as node labeling on social networks Kipf and Welling (2016b), link prediction in knowledge graphs Schlichtkrull et al. (2018), and molecular graph classification in quantum chemistry Gilmer et al. (2017).

Notably, a recent study Ma et al. (2021) has proven that GCN, even for general messaging passing models, intrinsically performs the 2\ell_{2}-based graph smoothing signal, which enforces smoothness globally, and the level and smoothness are often shared across the whole graph. As opposed to 2\ell_{2}-based graph smoothing, 1\ell_{1}-based methods tend to penalize large values less and thus preserve discontinuity of non-smooth signal better Nie et al. (2011); Wang et al. (2015); Liu et al. (2021). Essentially, 1\ell_{1}-based methods are equivalent to soft-thresholding operations for iterative estimators and guarantee statistical properties (e.g., model selection consistency). Owning to these advantages, trend filtering Tibshirani (2014), and graph trend filter Wang et al. (2015); Verma and Zhang (2019) indicate that 1\ell_{1}-based graph smoothing can adapt to the inhomogeneous level of smoothness of signals and yield estimator with kk-th piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator.

To enhance the local smoothness adaptivity of GCNs, a family of elastic-type GCNs with a combination of 2\ell_{2} and 1\ell_{1}-based penalties are proposed by Liu et al. (2021), which demonstrate that the elastic GCNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks.

Under the regularized learning framework, this paper further studies a class of p\ell_{p}-regularized learning approaches (1<p2)(1<p\leq 2), in order to trade-off the local smoothness of GCNs and sparsity between nodes. To be precise, it has been shown that an extreme case for p\ell_{p}-regularized learning with p1p\rightarrow 1 tends to generate soft-sparsity of solutions Koltchinskii (2009). In analogy to elastic-type GCNs, general p\ell_{p}-regularization can be interpreted as an interpolation of 1\ell_{1}-regularization and 2\ell_{2}-regularization. However, in contrast with the elastic-type GCNs, p\ell_{p}-regularized GCNs only involve a regularized parameter but leads to some additional technical difficulties in optimization and theoretical analysis.

The generalization performance of an algorithm has always been a central issue in learning theory, and particularly the generalization guarantees of GNNs has attracted a considerable amount of attention in recent years Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020).

It is worth noting that, although 1\ell_{1}-regularization possesses a number of attractive properties such as “automatic” variable selection, the objectiveness with p=1p=1 is not a strictly convex smooth function, which has been proved not to be uniform stability Xu et al. (2011). On the other hand, it is known that for p>1p>1, the penalty is a strictly convex function over bounded domains and thus enjoys robustness to some extent. As a bridge of sparse 1\ell_{1}-regularization and dense 2\ell_{2}-regularization, p\ell_{p}-based learning allows us to explicitly observe a trade-off between sparsity and smoothness of the estimated learners.

Although this idea is natural within the framework of regularized learning, it still faces many technical challenges when applied to GCNs. First, to address the problem of uniform stability of an SGD algorithm that can induce generalization performance Bousquet and Elisseeff (2002), existing related analysis for SGD Verma and Zhang (2019); Hardt et al. (2016) required that the first derivative of the objectiveness is Lipschitz continuous, which is not applicable to p\ell_{p}-based GCN. Second, to derive an interpretable generalization bound, it is important to know how such a result depends on the structure of the graph filter and the regularized parameter pp, as well as the network size and the sample size.

Overall, our principal contributions can be summarized as follows:

  • We introduce p\ell_{p}-regularized learning approaches for one-layer GCN to provide an explicit theoretical characterization of the trade-off between local smoothness and sparsity of the SGD algorithm for GCN. Crucially, we analyze how this trade-off between the graph structure and the regularization parameter pp affects the generalization capacity of our SGD method.

  • We propose a novel regularized stochastic algorithm for GCN, i.e., Inexact Proximal SGD, by integrating the standard SGD projection and the proximal operator. For our proposed method, we derive interpretable generalization bounds in terms of the graph structure, the regularized parameter pp, the network width, and the sample size.

  • To our knowledge, we are the first to analyze the generalization performance of p\ell_{p}-SGD in GNNs, which is quite different from existing stability-based generalization analysis for a second derivative objectiveness Verma and Zhang (2019); Hardt et al. (2016). We also have to overcome additional challenges in understanding the nature of the stochastic gradient for GCN, posed by the message passing nature in GCN. We conduct several numerical experiments to illustrate the superiority of our method to traditional smooth-based GCN, and we also observe some sparse solutions through our experiments as pp is sufficiently close to 11.

1.1 Additional Related Work

In this subsection, we briefly review two kinds of related works: Generalization Analysis for GNNs and Regularized Schemes on GNNs.

Generalization Analysis for GNNs.

Some previous work has attempted to address the generalization guarantees of GNNs, including Verma and Zhang (2019); Garg et al. (2020); Liao et al. (2020); Oono and Suzuki (2020). However, most of these works established uniform convergence results using classical Rademacher complexityGarg et al. (2020); Oono and Suzuki (2020) and PAC bounds Liao et al. (2020). Compared to these abstract capacity notations, the stability concept, used recently for GNN in Verma and Zhang (2019), is more intuitive and directly defined over a specific algorithm. Although the stability for generalization of GNNs has been considered recently by Verma and Zhang (2019), a major difference from their work is that we focus on the trade-off between soft sparsity and generalization of the p\ell_{p}-based GCN with varying p(1,2]p\in(1,2]. Moreover, we propose a new SGD algorithm for GCN, under which we provide novel theoretical proof for the stability bound of our SGD.

Regularized Schemes on GNNs.

Regularization methods are frequently used in machine learning, especially the prior literature Wibisono et al. (2009) has studied the influence of p\ell_{p}-regularized learning on generalization ability. Note that the previous work only considered the impact of general regularization estimates on stability without a specific algorithm. In addition, their research is based on regular data and does not involve any graph structure. As far as we know, this paper is the first time to consider p\ell_{p}-regularized learning in a graph model.

2 Preliminaries and Methodology

In this section, we first describe basic notation on graph and standard versions of GCN. Then we introduce the structural empirical risk minimization for a single-layer GCN model under i.i.d. sampling process, and thus our p\ell_{p}-regularized approaches is naturally formulated to estimate the graph filter parameters.

A graph is represented as G=(V,E)G=(V,E), where V={ν1,ν2,,νn}V=\{\nu_{1},\nu_{2},...,\nu_{n}\} is a set of n=|V|n=|V| nodes and EE is a set of |E||E| edges. The adjacency matrix of the graph is denoted by 𝐀=(aij)n×n\mathbf{A}=(a_{ij})\in\mathbb{R}^{n\times n}, whose entries aij=1a_{ij}=1 if (νi,νj)E(\nu_{i},\nu_{j})\in E, and aij=0a_{ij}=0 otherwise. Each node’s own feature vector is denoted by 𝐱id\mathbf{x}_{i}\in\mathbb{R}^{d}, i[n]i\in[n], where dd is the dimension of the node feature. Let 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} denote the node feature matrix with each row being dd features. The 11-hop neighborhood of a node νi\nu_{i} is defined as the set {νj,(νi,νj)E}\{\nu_{j},(\nu_{i},\nu_{j})\in E\}, and denote by N(i)N_{(i)} the set that includes the node νi\nu_{i} and all nodes belonging to its 11-hop neighborhood. The main task of graph models is to combine the feature information and the edge information to perform learning tasks.

For an undirected graph, its Laplacian matrix 𝐋n×n\mathbf{L}\in\mathbb{R}^{n\times n} is defined as 𝐋:=𝐃𝐀\mathbf{L}:=\mathbf{D}-\mathbf{A}, where 𝐃n×n\mathbf{D}\in\mathbb{R}^{n\times n} is a degree diagonal matrix whose diagonal entry dii=jaijd_{ii}=\sum_{j}a_{ij} for i[n]i\in[n]. The semi-definite matrix 𝐋\mathbf{L} has an eigen-decomposition written by 𝐋=𝐔𝚲𝐔T\mathbf{L}=\mathbf{U}\bm{\Lambda}\mathbf{U}^{T}, where the columns of 𝐔\mathbf{U} are the eigenvectors of 𝐋\mathbf{L} and the diagonal entries of diagonal matrix 𝚲\bm{\Lambda} are the non-negative eigenvalues of 𝐋\mathbf{L}.

For a fixed function gg, we define a graph filter g(𝐋)n×ng(\mathbf{L})\in\mathbb{R}^{n\times n} as a function on the graph Laplacian 𝐋\mathbf{L}. Following the eigen-decomposition of 𝐋\mathbf{L}, we get g(𝐋)=𝐔g(𝚲)𝐔Tg(\mathbf{L})=\mathbf{U}g(\bm{\Lambda})\mathbf{U}^{T}, where the eigenvalues are given by λi(g)={g(λi),1in}\lambda_{i}^{(g)}=\{g(\lambda_{i}),1\leq i\leq n\}. We define λGmax=max{|λi(g)|}\lambda^{max}_{G}=\max\{|\lambda_{i}^{(g)}|\} as the largest absolute eigenvalue of the graph filter g(𝐋)g(\mathbf{L}).

In this paper, we are concerned with node-level semi-supervised learning problems over the graph GG. Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be the input space in which the node feature is well defined, and accordingly 𝒴\mathcal{Y}\subset\mathbb{R} be the output space. In the semi-supervised setting, one assumes that only a portion of the training samples are labeled while amounts of unlabeled data are collected easily. Precisely, we merely collect a training set with labels D={𝐳i=(𝐱i,yi)}i=1mD=\{\mathbf{z}_{i}=(\mathbf{x}_{i},y_{i})\}_{i=1}^{m} with mnm\ll n. For statistical inference, one often assumes that these pairwise sample are independently drawn from a joint distribution ρ\rho defined on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. In such case, our studied model belongs to node-focused tasks on graph, as opposed to graph-focused tasks where the whole graph can be viewed as a single sample.

The most simple graph neural network, known as the Vanilla GCN, was proposed in Kipf and Welling (2016a), where each layer of a multilayer network is multiplied by the graph filter before applying a nonlinear activation function. In a matrix form, a conventional multi-layer GCN is represented by a layer-wise propagation rule

𝐇(k+1)=σ(g(𝐋)𝐇(k)𝐖(k+1)),\displaystyle\mathbf{H}^{(k+1)}=\sigma\big{(}g(\mathbf{L})\mathbf{H}^{(k)}\mathbf{W}^{(k+1)}\big{)}, (2.1)

where 𝐇(k+1)n×mk+1\mathbf{H}^{(k+1)}\in\mathbb{R}^{n\times m_{k+1}} is the node feature representation output by the (k+1)(k+1)-th GCN layer, and specially 𝐇(0)=𝐗\mathbf{H}^{(0)}=\mathbf{X} and m0=dm_{0}=d. 𝐖(k+1)mk×mk+1\mathbf{W}^{(k+1)}\in\mathbb{R}^{m_{k}\times m_{k+1}} represents the estimated weight matrix of the (k+1)(k+1)-th GCN layer, and σ\sigma is a point-wise nonlinear activation function. Under the context of GCN for performing semi-supervised learning, the sampling procedure of nodes from the graph GG is conducted by two stages. We assume node data are sampled in an i.i.d. manner by first choosing a sample 𝐱i\mathbf{x}_{i} or 𝐳i\mathbf{z}_{i} at node ii, and then extracting its neighbors from GG to form an ego-graph.

To interpret learning mechanism clearly for GCN, this paper focuses on a node-level task over graph with a single layer GCN model. In such case, putting all graph nodes together, the output function can be written in a matrix form as follows,

f(𝐗,𝐰)=σ(g(𝐋)𝐗𝐰),f(\mathbf{X},\mathbf{w})=\sigma\big{(}g(\mathbf{L})\mathbf{X}\mathbf{w}\big{)}, (2.2)

where 𝐰d\mathbf{w}\in\mathbb{R}^{d} and f(𝐗,𝐰)nf(\mathbf{X},\mathbf{w})\in\mathbb{R}^{n}. Some commonly used graph filters include a linear function of 𝐀\mathbf{A} as g(𝐀)=𝐀+𝐈g(\mathbf{A})=\mathbf{A}+\mathbf{I} Xu et al. (2018) or a Chebyshev polynomial of 𝐋\mathbf{L} Defferrard et al. (2016).

Under the context of ego-graph, each node contains the complete information needed for computing the output of a single layer GCN model. Given node with the feature 𝐱\mathbf{x}, let N𝐱N_{\mathbf{x}} denote a set of the neighbor indexes at most 11-hop distance neighbors, which is completely determined by the graph filter g(𝐋)g(\mathbf{L}). Thus we can rewrite the predictor (2.2) for a single node prediction as

f(𝐱,𝐰)=σ(jN𝐱ej𝐱jT𝐰),f(\mathbf{x},\mathbf{w})=\sigma\big{(}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)}, (2.3)

where ej=[g(𝐋)]je_{\cdot j}=[g(\mathbf{L})]_{\cdot j}\in\mathbb{R} is regraded as a weighted edge between node 𝐱\mathbf{x} and its neighbor 𝐱j\mathbf{x}_{j}, and particularly it still holds jN𝐱j\in N_{\mathbf{x}} if and only if ej0e_{\cdot j}\neq 0.

Let (,)\ell(\cdot,\cdot) be a convex loss function, measuring the difference between a predictor and the true label, a variety of supervised learning problems in machine learning can be formulated as a minimization of the expectation risk,

minf𝔼[(Y,f(X))],\min_{f\in\mathcal{F}}\mathbb{E}\big{[}\ell\big{(}Y,f(X)\big{)}\big{]}, (2.4)

where \mathcal{F} is a hypothesis space under which an optimal learning rule is generated. The standard regression problems correspond to the square loss given by (u,v)=(uv)2\ell(u,v)=(u-v)^{2}, and the logistic loss is widely used for classification. The optimal decision function denoted by f0f_{0} is any minimizer of (2.4) when \mathcal{F} is taken to be the space of all measurable functions. However, f0f_{0} can not be computed directly, due to the fact that ρ\rho is often unknown and \mathcal{F} is too complex to compute it possibly. Instead, a frequently used method consists of minimizing a regularized empirical risk over a computationally-feasible space

minf1mi=1m(yi,f(𝐱i))+Ω(f),\displaystyle\min_{f\in\mathcal{H}}\frac{1}{m}{\sum}_{i=1}^{m}\ell\big{(}y_{i},f(\mathbf{x}_{i})\big{)}+\Omega(f), (2.5)

where Ω(f)\Omega(f) is a penalty function that regularizes the complexity of the function ff, while \mathcal{H} is the space of all predictors which needs to be parameterized explicitly or implicitly. For instance, the neural network is known as an efficient parameterized approximation to any complex nonlinear function. In the work, all the functions within \mathcal{H} consist of the form in (2.3).

2.1 Empirical Risk Minimization with p\ell_{p}-Regularizer

A fundamental class of learning algorithms can be described as the regularized empirical risk minimization problems. This paper considers an p\ell_{p}-regularized learning approach for training the parameters of GCN:

𝐰^argmin𝐰1ni=1n(yi,f(𝐱i,𝐰))+λ𝐰pp,\displaystyle\widehat{\mathbf{w}}\in\arg\min_{\mathbf{w}}\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{w})\big{)}+\lambda\|\mathbf{w}\|_{\ell_{p}}^{p}, (2.6)

where 1<p21<p\leq 2 and the p\ell_{p}-norm on d\mathbb{R}^{d} is defined as 𝐰pp=j=1d|wj|p\|\mathbf{w}\|_{\ell_{p}}^{p}=\sum_{j=1}^{d}|w_{j}|^{p}. For any 1<pp21<p^{\prime}\leq p\leq 2, there always holds 𝐰p𝐰p\|\mathbf{w}\|_{\ell_{p}}\leq\|\mathbf{w}\|_{\ell_{p^{\prime}}}, which means that any learning method with the p\ell_{p^{\prime}}-regularizer imposes heavier penalties for the parameters than the one with the p\ell_{p}-regularizer. Specially when p1p\rightarrow 1, the corresponding p\ell_{p}-regularized algorithm tends to generate so called soft sparse solutions Koltchinskii (2009). In contrast to this, the commonly-used 2\ell_{2}-regularization tends to generate smooth but non-sparse solutions.

Note that we do not require that the minimizer of (2.6) is unique, catering to non-convex problems. The following lemma tells us that any global minimizer of (2.6) can be upper bounded by a quantity that is inversely proportional to λ\lambda. This simple conclusion is useful in subsequent sections for designing a constrained SGD and our theoretical analysis with respect to the function 𝐰pp\|\mathbf{w}\|_{\ell_{p}}^{p}.

Lemma 1.

Assume that (y,σ(0))B\ell(y,\sigma(0))\leq B with some B>0B>0. For any λ>0\lambda>0, any global minimizer 𝐰^\widehat{\mathbf{w}} of (2.6) satisfies 𝐰^ppB/λ\|\widehat{\mathbf{w}}\|_{\ell_{p}}^{p}\leq{B}/{\lambda}, and furthermore 𝐰^2(B/λ)1/p\|\widehat{\mathbf{w}}\|_{\ell_{2}}\leq\big{(}{B}/{\lambda}\big{)}^{1/p}.

Proof of Lemma 1..

Since 𝐰^\widehat{\mathbf{w}} is a global minimizer of the objective function in (2.6), this follows that

λ𝐰^pp\displaystyle\lambda\|\widehat{\mathbf{w}}\|_{\ell_{p}}^{p} 1ni=1n(yi,f(𝐱i,𝟎))+λ𝟎ppB.\displaystyle\leq\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{0})\big{)}+\lambda\|\mathbf{0}\|_{\ell_{p}}^{p}\leq B.

Moreover, for any 1<p21<p\leq 2, it is known that 𝐰2𝐰p\|\mathbf{w}\|_{\ell_{2}}\leq\|\mathbf{w}\|_{\ell_{p}} for any 𝐰d\mathbf{w}\in\mathbb{R}^{d}. This completes the proof of the lemma. ∎

Lemma 1 implies that the empirical solutions of (2.6) are in an p\ell_{p}-ball of certain radius which depends on the regularized parameter λ\lambda. This shows that it suffices to analyze statistical behaviors of the given estimator projected into this ball.

3 Regularized Stochastic Algorithm

In order to effectively solve non-convex problems with massive data, practical algorithms for machine learning are increasingly constrained to spend less time and use less memory, and can also escape from saddle points that often appear in non-convex problems and tend to converge to a good stationary point. Stochastic gradient descent (SGD) is perhaps the simplest and most well studied algorithm that enjoys these advantages. The merits of SGD for large scale learning and the associated computation versus statistics tradeoffs is discussed in detail by the seminal work of Bottou and Bousquet (2007).

A standard assumption to analyze SGD in the literature is that the derivative of the objective function is Lipschitz smooth Verma and Zhang (2019); Hardt et al. (2016), however, the p\ell_{p}-regularized learning does not meet such condition. To address this issue, we propose a new SGD for (2.6) with an inexact proximal operator, and then develop a novel theoretical analysis for an upper bound of uniform stability Bousquet and Elisseeff (2002), which is an algorithm-dependent sensitivity-based measurement used for characterizing generalization performance in learning theory.

Given that a positive pair (p,q)(p,q) satisfies the equality 1/p+1/q=1\nicefrac{{1}}{{p}}+\nicefrac{{1}}{{q}}=1, then the norms 𝐰p\|\mathbf{w}\|_{p} and 𝐰q\|\mathbf{w}\|_{q} are dual to each other. Moreover, the pair of functions (1/2)𝐰p2(1/2)\|\mathbf{w}\|^{2}_{p} and (1/2)𝐰q2(1/2)\|\mathbf{w}\|^{2}_{q} are conjugate functions of each other. As a consequence, their gradient mappings are a pair of inverse mapping. Formally, let p(1,2]p\in(1,2] and q=p/(p1)q=p/(p-1), and define the mapping Φ:EE\Phi:\mathrm{E}\rightarrow\mathrm{E}^{*} with

Φj(𝐰)=j(12𝐰p2)=sgn(wj)|wj|p1𝐰pp2,j=1,2,,d,\Phi_{j}(\mathbf{w})=\nabla_{j}\big{(}\frac{1}{2}\|\mathbf{w}\|^{2}_{p}\big{)}=\frac{\hbox{sgn}(w_{j})|w_{j}|^{p-1}}{\|\mathbf{w}\|_{p}^{p-2}},\,j=1,2,...,d,

and the inverse mapping Φ1:EE\Phi^{-1}:\mathrm{E}^{*}\rightarrow\mathrm{E} with

Φj1(𝐯)=j(12𝐯q2)=sgn(vj)|vj|q1𝐯qq2,j=1,2,,d.\Phi^{-1}_{j}(\mathbf{v})=\nabla_{j}\big{(}\frac{1}{2}\|\mathbf{v}\|^{2}_{q}\big{)}=\frac{\hbox{sgn}(v_{j})|v_{j}|^{q-1}}{\|\mathbf{v}\|_{q}^{q-2}},\,j=1,2,...,d.

The above conjugate property on p\ell_{p}-space and q\ell_{q}-space is very useful for bounding uniform stability without the help of strong smoothness, while the latter is a standard assumption in optimization.

3.1 SGD with Inexact Proximal Operator

We write Li(𝐰):=(yi,f(𝐱i,𝐰))L_{i}(\mathbf{w}):=\ell(y_{i},f(\mathbf{x}_{i},\mathbf{w})) for notational simplicity, and define a local quadratic approximation of LiL_{i} at point 𝐰D,t\mathbf{w}_{D,t} as

Pi,ri(𝐰,𝐰D,t):=\displaystyle P_{i,r_{i}}(\mathbf{w},\mathbf{w}_{D,t}):= Li(𝐰D,t)+𝐰𝐰D,t,Li(𝐰D,t)2\displaystyle L_{i}(\mathbf{w}_{D,t})+\langle\mathbf{w}-\mathbf{w}_{D,t},\nabla L_{i}(\mathbf{w}_{D,t})\rangle_{2}
+ri𝐰𝐰D,t22.\displaystyle+r_{i}\|\mathbf{w}-\mathbf{w}_{D,t}\|_{2}^{2}.

At each iteration tt, let iti_{t} be a random index sampled uniformly from [n][n] on DD. Then replacing 1ni=1nLi(𝐰)\frac{1}{n}\sum_{i=1}^{n}L_{i}(\mathbf{w}) in (2.6) by the quadratic term Pit,rit(𝐰,𝐰D,t)P_{i_{t},r_{i_{t}}}(\mathbf{w},\mathbf{w}_{D,t}), we propose an inexact proximal method for SGD with the p\ell_{p}-regularizer. Precisely, we are concerned with the following iterative to update 𝐰D,t\mathbf{w}_{D,t} for a regularized-based SGD,

min𝐰Pit,rit(𝐰,𝐰D,t)+λt𝐰pp.\displaystyle\min_{\mathbf{w}}P_{i_{t},r_{i_{t}}}(\mathbf{w},\mathbf{w}_{D,t})+\lambda_{t}\|\mathbf{w}\|_{\ell_{p}}^{p}. (3.1)

In view of the boundedness of 𝐰^2\|\widehat{\mathbf{w}}\|_{\ell_{2}} given in Lemma 1, we adopt the projection technique to execute a constrained SGD over the empirical risk term in (3.1). To this end, we define the projection onto a set 𝒞\mathcal{C} by

Π𝒞(𝐯):=argmin𝐰𝒞𝐰𝐯2.\Pi_{\mathcal{C}}(\mathbf{v}):=\arg\min_{\mathbf{w}\in\mathcal{C}}\|\mathbf{w}-\mathbf{v}\|_{2}.

This reveals that the definition of projection is an optimization problem in itself. In our case, the set we adopt is given as

𝒞:=𝒞λ={𝐰d,𝐰2(B/λ)1/p}.\mathcal{C}:=\mathcal{C}_{\lambda}=\big{\{}\mathbf{w}\in\mathbb{R}^{d},\,\,\|\mathbf{w}\|_{2}\leq(B/\lambda)^{1/p}\big{\}}.

It is well known that, if 𝒞=2(1)\mathcal{C}=\mathcal{B}_{2}(1), i.e., the unit 2\ell_{2} ball, then projection is equivalent to a normalization step

Π2(1)(𝐯)={𝐯/𝐯2if𝐯2>1,𝐯otherwise.\Pi_{\mathcal{B}_{2}(1)}(\mathbf{v})=\begin{cases}\mathbf{v}/\|\mathbf{v}\|_{2}&\text{if}\,\,\|\mathbf{v}\|_{2}>1,\\ \mathbf{v}&\text{otherwise}.\end{cases}

Up to the terms that do not depend on 𝐰\mathbf{w}, summing the objective function in (3.1) and the projection formulate our proposed algorithm as follows

𝐯D,it\displaystyle\mathbf{v}_{D,i_{t}} =Π𝒞λ(𝐰D,t(ηLit(𝐰D,t)),\displaystyle=\Pi_{\mathcal{C}_{\lambda}}\big{(}\mathbf{w}_{D,t}-(\eta\nabla L_{i_{t}}(\mathbf{w}_{D,t})\big{)}, (3.2)
𝐰D,t+1\displaystyle\mathbf{w}_{D,t+1} =argmin𝐰{12𝐰𝐯D,it22+λt𝐰pp},\displaystyle=\arg\min_{\mathbf{w}}\big{\{}\frac{1}{2}\big{\|}\mathbf{w}-\mathbf{v}_{D,i_{t}}\big{\|}_{2}^{2}+\lambda_{t}\|\mathbf{w}\|_{\ell_{p}}^{p}\big{\}}, (3.3)

where η>0\eta>0 is the learning rate that depends on ritr_{i_{t}}, and the λt\lambda_{t} may depend on λ\lambda and ritr_{i_{t}}.

Remark 1.

The update rule in (3.3) is seen as a contraction of conventional SGD, see Lemma 2 below for details. We will obtain analytical solutions of (3.3) for some specific pp (e.g. p=1,2p=1,2). Although there is no analytical solutions for general 1<p21<p\leq 2, the objective function in (3.3) is strongly convex over bounded domains and thus a global convergence can be guaranteed. For the ease of notation, we still denote by 𝐰D,t+1\mathbf{w}_{D,t+1} the realized numerical solution with ignoring the inner optimization error.

Lemma 2.

For 1<p21<p\leq 2 and a vector 𝐯d\mathbf{v}\in\mathbb{R}^{d} is given, we define

𝐰=argmin𝐰{12𝐰𝐯22+λ𝐰pp}:=Proλ,p(𝐯).\displaystyle\mathbf{w}^{*}=\arg\min_{\mathbf{w}}\big{\{}\frac{1}{2}\|\mathbf{w}-\mathbf{v}\|_{2}^{2}+\lambda\|\mathbf{w}\|_{\ell_{p}}^{p}\big{\}}:=\hbox{Pro}_{\lambda,p}(\mathbf{v}). (3.4)

Then, we conclude that

|wj|min{|vj|,(|vj|/(λp))1/(p1)},j=1,2,d.|w_{j}^{*}|\leq\min\{|v_{j}|,(|v_{j}|/(\lambda p))^{1/(p-1)}\},\quad\,\forall\,j=1,2...,d.

We defer the proof of Lemma 2 to the Appendix. Applying Lemma 2 and the projection onto 𝒞λ\mathcal{C}_{\lambda} in (3.2), we have the following inequality

𝐰D,t2(B/λ)1/p,t,λ>0.\displaystyle\|\mathbf{w}_{D,t}\|_{2}\leq(B/\lambda)^{1/p},\quad\forall\,t,\lambda>0. (3.5)

Consider a function h:h:\mathbb{R}\rightarrow\mathbb{R} defined as h(θ)=|θ|ph(\theta)=|\theta|^{p} (1<p2)(1<p\leq 2). Note that it holds

h(θ)=p.sign(θ).|θ|p1.h^{\prime}(\theta)=p.\hbox{sign}(\theta).|\theta|^{p-1}.

Obviously the first derivative of hh exists and is continuous for all θ\theta\in\mathbb{R}. However, we notice that the inexact proximal operator Proλ,p\hbox{Pro}_{\lambda,p} is not Lipschitz, due to the fact that pp\|\cdot\|_{\ell_{p}}^{p} is not strongly smooth. Hence, as mentioned earlier, those conventional technique analysis under strongly smooth condition for objective functions are no longer valid in our case. Fortunately, the function hh is strongly convex over bounded domains, as shown in Lemma 1, which enables us to avoid restrictive smooth assumptions by an alternative proof strategy.

Refer to caption
Figure 1: Generalization Gaps as a function of the number of epochs under various pp on three datasets. We observe that the miniature pp achieves weaker generalization gap and thus worse than the significant pp.

4 Stability and Generalization Bounds

In this section, we provide algorithm-dependent generalization bounds via the notion of stability for p\ell_{p}-regularized GCN. To this end, we first introduce the notion of algorithmic stability and thereby present a generalization bound associated with the algorithmic stability.

Let 𝒜D\mathcal{A}_{D} be a learning algorithm trained on dataset DD, which can be viewed as a map from DD\rightarrow\mathcal{H}. For GCN, we set 𝒜D=f(𝐱,𝐰D)\mathcal{A}_{D}=f(\mathbf{x},\mathbf{w}_{D}). The overall learning performance of 𝒜D\mathcal{A}_{D} is measured by the following expected risk:

R(𝒜D):=𝔼𝐳[(y,f(𝐱,𝐰D))].R(\mathcal{A}_{D}):=\mathbb{E}_{\mathbf{z}}\big{[}\ell\big{(}y,f(\mathbf{x},\mathbf{w}_{D})\big{)}\big{]}.

Accordingly, the empirical risk of 𝒜D\mathcal{A}_{D} with the loss \ell is given as

Rn(𝒜D):=1ni=1n(yi,f(𝐱i,𝐰D)).R_{n}(\mathcal{A}_{D}):=\frac{1}{n}{\sum}_{i=1}^{n}\ell\big{(}y_{i},f(\mathbf{x}_{i},\mathbf{w}_{D})\big{)}.

Even when the sample is fixed, 𝒜D\mathcal{A}_{D} may be still a randomized algorithm due to the randomness of algorithm procedure (e.g. SGD). In this context, we define the expected generalization error as

Egen:=𝔼𝒜[R(𝒜D)Rn(𝒜D)],E_{\text{gen}}:=\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]},

where the expectation 𝔼𝒜\mathbb{E}_{\mathcal{A}} is taken over the inherent randomness of 𝒜D\mathcal{A}_{D}.

For a randomized algorithm, to introduce the notation of its uniform stability, we need to define two datasets as follows. Given the training set DD defined as above, we introduce two related sets in the following:

Removing the ii-th data point in the set DD is represented as

D\i={𝐳1,𝐳2,𝐳i1,𝐳i+1,,𝐳n},D^{\backslash i}=\{\mathbf{z}_{1},\mathbf{z}_{2},...\mathbf{z}_{i-1},\mathbf{z}_{i+1},...,\mathbf{z}_{n}\},

and replacing the ii-th data point in DD by 𝐳i\mathbf{z}^{{}^{\prime}}_{i} is represented as

Di={𝐳1,𝐳2,𝐳i1,𝐳i,𝐳i+1,𝐳n}.D^{i}=\{\mathbf{z}_{1},\mathbf{z}_{2},...\mathbf{z}_{i-1},\mathbf{z}^{{}^{\prime}}_{i},\mathbf{z}_{i+1}...,\mathbf{z}_{n}\}.
Definition 1.

A randomized learning algorithm 𝒜D\mathcal{A}_{D} is βn\beta_{n}-uniformly stable with respect to a loss function \ell, if it satisfies

supD,𝐳|𝔼𝒜[(y,f(𝐱,𝐰D))]𝔼𝒜[(y,f(𝐱,𝐰D\i))]|βn.\sup_{D,\mathbf{z}}\big{|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{\backslash i}}))]\big{|}\leq\beta_{n}.

By the triangle inequality, the following result on another uniform stability associated with SiS^{i} holds

supD,𝐳|𝔼𝒜[(y,f(𝐱,𝐰D))]𝔼𝒜[(y,f(𝐱,𝐰Di))]|2βn.\sup_{D,\mathbf{z}}\big{|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i}}))]\big{|}\leq 2\beta_{n}.

Stability is property of a learning algorithm, roughly speaking, if two training samples are close to each other, a stable algorithm will generate close output results. There are many variants of stability, such as hypothesis stability Kearns and Ron (1999), sample average stability Shalev-Shwartz et al. (2010) and uniform stability. This paper will focus on the uniform stability, since it is closely related to other types of stability.

The following lemma shows that a randomized learning algorithm with uniform stability can guarantee meaningful generalization bound, which has been proved in Verma and Zhang (2019).

Lemma 3.

A uniform stable randomized algorithm (𝒜D,βn)(\mathcal{A}_{D},\beta_{n}) with a bounded loss function 0(y,f(𝐱))B0\leq\ell(y,f(\mathbf{x}))\leq B, satisfies the following generalization bound with probability at least 1δ1-\delta, over the random draw of D,𝐳D,\mathbf{z} with δ(0,1)\delta\in(0,1),

𝔼𝒜[R(𝒜D)Rn(𝒜D)]2βn+(4nβn+B)log(1/δ)2n.\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]}\leq 2\beta_{n}+(4n\beta_{n}+B)\sqrt{\frac{\log(1/\delta)}{2n}}.

We now give some smooth assumptions on loss function and activation function used for analyzing the stability of stochastic gradient descent. The following assumptions are very standard in optimization literature.

Assumption 1 (Smoothness for loss function and activation function).

We assume that the loss function is lipschitz continuous and smooth,

|(y,f())(y,f())|a|f()f()|.f,f\displaystyle|\ell(y,f(\cdot))-\ell(y,f^{\prime}(\cdot))|\leq a_{\ell}|f(\cdot)-f^{\prime}(\cdot)|.\quad\forall\,f,f^{\prime}\in\mathcal{H}
|(y,f())(y,f())|b|f()f()|,\displaystyle|\ell^{\prime}(y,f(\cdot))-\ell^{\prime}(y,f^{\prime}(\cdot))|\leq b_{\ell}|f(\cdot)-f^{\prime}(\cdot)|,

where aa_{\ell} and bb_{\ell} are two positive constants. Similarly, the activation function also satisfies

|σ(x)σ(y)|aσ|xy|,\displaystyle|\sigma(x)-\sigma(y)|\leq a_{\sigma}|x-y|,\,\, |σ(x)σ(y)|bσ|xy|\displaystyle|\sigma^{\prime}(x)-\sigma^{\prime}(y)|\leq b_{\sigma}|x-y|\,
and (y,f(𝐱))B,x,y,\displaystyle\ell(y,f(\mathbf{x}))\leq B,\,\forall\,x,y\in\mathbb{R},

where aσ,bσa_{\sigma},\,b_{\sigma} and BB are positive constants as well.

We now present an explicit stability bound for GCN with p\ell_{p}-regularizer via stochastic gradient method.

Theorem 1.

Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed SGD given in (3.2)-(3.3) for iteration TT, is βn\beta_{n}-uniformly stable, precisely

βna2aσ2λGmaxηCp,λ𝐠ent=1T(Cp,λ(1+(aσ2+a)η𝐠e2))t1,\displaystyle\beta_{n}\leq a_{\ell}^{2}a_{\sigma}^{2}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1},

where Cp,λ:=28p(p1)λt(B/λ)(3p)/pC_{p,\lambda}:=\frac{28}{p(p-1)\lambda_{t}}\big{(}\nicefrac{{B}}{{\lambda}}\big{)}^{(3-p)/p} and 𝐠e:=sup𝐱jN𝐱ej𝐱j2\mathbf{g}_{e}:=\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}.

The proof procedure of Theorem 1 will be given in Appendix. The key step of our proof is that in such a scenario, the error caused by the difference in the nonconvex empirical risk of GCN grows polynomially with the number of iterations.

Remark 2.

Theorem 1 provides the uniform stability for the last iterate of SGD for p\ell_{p}-regularized GCN. It is worth mentioning that the upper bound of this stability depends on the graph structure (i.e. 𝐠e\mathbf{g}_{e} and λGmax\lambda_{G}^{max}) and the regularized hyper-parameter pp, as well as the sample size and the learning rate in SGD.

Remark 3.

More precisely, the result in Theorem 1 shows that the stability bound decreases inversely with the scale of pp. It increases as the graph structured parameter λGmax\lambda_{G}^{max} increases.

Substituting Lemma 3 into Theorem 1 above, we obtain the generalization bounds with uniform stability for a single layer GCN with p\ell_{p}-regularizer.

Theorem 2.

Under the same conditions as Theorem 1. with a high probability, the following generalization bound holds

𝔼𝒜[R(𝒜D)Rn(𝒜D)]\displaystyle\mathbb{E}_{\mathcal{A}}\big{[}R(\mathcal{A}_{D})-R_{n}(\mathcal{A}_{D})\big{]}
=𝒪(λGmaxηCp,λ𝐠ent=1T(Cp,λ(1+(aσ2+a)η𝐠e2))t1).\displaystyle=\mathcal{O}\Big{(}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{\sqrt{n}}{\sum}_{t=1}^{T}\big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\big{)}^{t-1}\Big{)}.
Remark 4.

Based on the result of Theorem 2, we conclude that p\ell_{p}-regularization for 1<p21<p\leq 2 generalizes. Note that, when p1p\rightarrow 1, the stability bound breaks due to the sparsity of 1\ell_{1}-regularization. The smaller pp becomes, the greater the stability parameter βn\beta_{n} is, but at the same time the obtained parameter 𝐰^\widehat{\mathbf{w}} in (2.6) tends to be sparse, shown in Koltchinskii (2009). These properties are also verified in our experimental evaluation.

Remark 5.

Although a small learning rate means a smaller generalization gap, the parameter range in which this SGD searches will be very small, resulting in a larger training error. Such conclusion is also applicable to various SGD for general models.

Dataset Graph Filter p=1.001{p=1.001} p=1.149{p=1.149} p=1.320{p=1.320} p=1.516{p=1.516} p=1.741{p=1.741} p=2{p=2}
Citeseer Augmented Normalized 57.44±0.87\mathbf{57.44\pm 0.87} 54.69±0.8254.69\pm 0.82 52.41±0.7352.41\pm 0.73 50.72±0.6750.72\pm 0.67 50.27±0.6850.27\pm 0.68 50.18±0.6950.18\pm 0.69
Normalized 57.02±1.32\mathbf{57.02\pm 1.32} 54.47±0.9954.47\pm 0.99 51.93±0.6651.93\pm 0.66 50.39±0.6250.39\pm 0.62 49.95±0.6249.95\pm 0.62 49.88±0.6349.88\pm 0.63
Random Walk 56.17±1.49\mathbf{56.17\pm 1.49} 53.84±1.2053.84\pm 1.20 51.84±0.8451.84\pm 0.84 50.54±0.7850.54\pm 0.78 50.16±0.8150.16\pm 0.81 50.13±0.850.13\pm 0.8
Unnormalized 60.62±2.25\mathbf{60.62\pm 2.25} 58.78±2.3158.78\pm 2.31 57.52±2.2357.52\pm 2.23 56.61±2.2556.61\pm 2.25 56.32±2.356.32\pm 2.3 56.45±2.256.45\pm 2.2
Cora Augmented Normalized 56.66±3.43\mathbf{56.66\pm 3.43} 54.46±2.6854.46\pm 2.68 52.18±2.1352.18\pm 2.13 50.79±1.8650.79\pm 1.86 50.41±1.7750.41\pm 1.77 50.27±1.7650.27\pm 1.76
Normalized 55.77±3.68\mathbf{55.77\pm 3.68} 53.74±2.7853.74\pm 2.78 51.6±2.1051.6\pm 2.10 50.22±1.8250.22\pm 1.82 49.77±1.8249.77\pm 1.82 49.69±1.8349.69\pm 1.83
Random Walk 53.43±2.12\mathbf{53.43\pm 2.12} 52.03±1.9452.03\pm 1.94 51.08±1.3751.08\pm 1.37 50.29±1.1850.29\pm 1.18 50.08±1.1850.08\pm 1.18 50.07±1.1750.07\pm 1.17
Unnormalized 64.94±3.61\mathbf{64.94\pm 3.61} 63.86±3.563.86\pm 3.5 63.31±3.5963.31\pm 3.59 62.79±3.5362.79\pm 3.53 62.87±3.5662.87\pm 3.56 62.82±3.562.82\pm 3.5
Pubmed Augmented Normalized 54.13±2.16\mathbf{54.13\pm 2.16} 52.61±2.6252.61\pm 2.62 50.83±1.9250.83\pm 1.92 49.73±1.6749.73\pm 1.67 49.24±1.6949.24\pm 1.69 49.06±1.6349.06\pm 1.63
Normalized 53.27±3.40\mathbf{53.27\pm 3.40} 52.48±2.3552.48\pm 2.35 51.28±2.0951.28\pm 2.09 50.29±1.9650.29\pm 1.96 49.56±1.7849.56\pm 1.78 49.30±1.5549.30\pm 1.55
Random Walk 54.72±3.57\mathbf{54.72\pm 3.57} 53.07±3.2053.07\pm 3.20 51.82±2.9451.82\pm 2.94 51.03±2.8451.03\pm 2.84 50.85±2.850.85\pm 2.8 50.77±2.7650.77\pm 2.76
Unnormalized 74.54±6.26\mathbf{74.54\pm 6.26} 74.10±6.4074.10\pm 6.40 73.29±6.373.29\pm 6.3 73.25±6.5773.25\pm 6.57 72.91±6.1272.91\pm 6.12 73.09±5.9673.09\pm 5.96
Table 1: The sparsity (%\%) of obtained parameter 𝐰^\widehat{\mathbf{w}} under different pp and normalization steps on three datasets.

5 Experiments

In this section, we conduct experiments to validate our theoretical findings. In particular, we show that there exists a stability-sparsity trade-off with varying pp, and the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter. To do it, we first introduce the experimental settings. Then we assess the uniform stability of GCN with semi-supervised learning tasks under varying pp. Finally, we evaluate the effect of different graph filters on the GCN stability bounds.

Refer to caption
Figure 2: Generalization Gaps under different normalization steps on three datasets. We observe that the normalized filters yield significantly lower generalization gaps than the unnormalized and random walk filters.

5.1 Experimental Setup

Datasets.

We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. (2008). In every dataset, each document is represented as spare bag-of-words feature vectors. The relationship between documents consists of a list of citation links, which can be treated as undirected edges and help construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label.

Baselines.

We implement several empirical experiments with a representative GCN model Kipf and Welling (2016a). For all datasets, we use 2-layer neural networks with 16 hidden units. In all cases, we evaluate the difference between the learned weight parameters and the generalization gap of two GCN models trained on datasets DD and DiD^{i}, respectively. Specifically, we generate DiD^{i} by choosing a random data point in the training set DD and altering it with a different random point. We also record the training and testing errors gap and the parameter distance 𝐰^𝐰^2/(𝐰^2+𝐰^2),\sqrt{\|\widehat{\mathbf{w}}-\widehat{\mathbf{w}}^{{}^{\prime}}\|^{2}/(\|\widehat{\mathbf{w}}\|^{2}+\|\widehat{\mathbf{w}}^{{}^{\prime}}\|^{2})}, where 𝐰^\widehat{\mathbf{w}} and 𝐰^\widehat{\mathbf{w}}^{{}^{\prime}} are the weight parameters of the respective models per epoch.

Training settings.

For each experiment, we initialize the parameters of GCN models with the same random seeds and then train all models for a maximum of 200 epochs using the proposed Inexact Proximal SGD. We repeat the experiments 10 times and report the average performance as well as the standard variance. For all methods, the hyperparameters are tuned from the following search space: (1) learning rate: {1, 0.5, 0.1, 0.05}; (2) weight decay: 0; (3) dropout rate: {0.3,0.5}\{0.3,0.5\}; (4) regularization parameter λ\lambda is set to 0.001.

5.2 The Effect of Varying pp

Generalization gap.

In this experiment, we empirically compare the effect of pp on the GCN stability bounds using different p{1.001,1.149,1.32,1.516,1.741,2}p\in\{1.001,1.149,1.32,1.516,1.741,2\}. We quantitatively measure the generalization gap defined as the absolute difference between the training and testing errors and the difference between learned weights parameters of GCN models trained on DD and DD^{\prime}, respectively. It can be observed that in Figure 1, these empirical observations are in line with our stability bounds (see also results in Figure 3 in Appendix).

Sparsity.

Besides the convergence, we have a particular concern about the sparsity of the solutions. The sparsity ratios for lpl_{p}-based method are summarized in Table 1. Observe that p\ell_{p}-regularized learning with p1p\rightarrow 1 identifies most of the sparsity pattern but behaves much worse in generalization.

5.3 The Effect of Graph Filters

Different normalizations steps.

In this experiment, we mainly consider the implication of our results in following designing graph convolution filters: (1) Unnormalized Graph Filters: g(𝐋)=𝐀+𝐈g(\mathbf{L})=\mathbf{A}+\mathbf{I}; (2) Normalized Graph Filters: g(𝐋)=𝐃1/2𝐀𝐃1/2+𝐈g(\mathbf{L})=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}+\mathbf{I}; (3) Random Walk Graph Filters: g(𝐋)=𝐃1𝐀+𝐈g(\mathbf{L})=\mathbf{D}^{-1}\mathbf{A}+\mathbf{I}; (4) Augmented Normalized Graph Filters: g(𝐋)=(𝐃+𝐈)1/2(𝐀+𝐈)(𝐃+𝐈)1/2g(\mathbf{L})=(\mathbf{D}+\mathbf{I})^{-1/2}(\mathbf{A}+\mathbf{I})(\mathbf{D}+\mathbf{I})^{-1/2}.

In this experiment, we quantitatively measure the generalization gap and parameter distance per epoch. From Figure 2, it is clear that the Unnormalized Graph Filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets. Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap and parameter distance become constant after a certain number of iterations. More results can be found in the supplementary material.

6 Conclusion

In this paper, we present an explicit theoretical understanding of stochastic learning for GCN with the lpl_{p} regularizer and analyze the stability of our regularized stochastic algorithm. In particular, our derived bounds show that the uniform stability of our GCN depends on the largest absolute eigenvalue of its graph filter, and there exists a generalization-sparsity tradeoff with varying pp. It is worth noting that previous generalization analysis based on stability notation assumed that objectiveness is a second derivative function, which is no longer applicable to our lpl_{p}-regularized learning scheme. To address this issue, we propose a new proximal SGD algorithm for GCNs with an inexact operator, which exhibits comparable empirical performances.

Acknowledgements

Lv’s research is supported by the “QingLan” Project of Jiangsu Universities. Ming Li would like to thank the support from the National Natural Science Foundation of China (No. U21A20473, No. 62172370) and Zhejiang Provincial Natural Science Foundation (No. LY22F020004).

Contribution Statement

Linsen We and Ming Li contributed equally to this work.

References

  • Barrett et al. [2018] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pages 511–520. PMLR, 2018.
  • Battaglia et al. [2016] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, pages 4502–4510, 2016.
  • Bottou and Bousquet [2007] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2007.
  • Bousquet and Elisseeff [2002] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
  • Bruna et al. [2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
  • Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, volume 29, pages 3844–3852, 2016.
  • Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, volume 28, pages 2224–2232, 2015.
  • Garg et al. [2020] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pages 3419–3430. PMLR, 2020.
  • Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
  • Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035, 2017.
  • Hardt et al. [2016] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
  • Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, pages 2323–2332. PMLR, 2018.
  • Kearns and Ron [1999] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999.
  • Kipf and Welling [2016a] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Kipf and Welling [2016b] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • Koltchinskii [2009] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. In Annales de l’IHP Probabilités et statistiques, volume 45, pages 7–57, 2009.
  • Liao et al. [2020] Renjie Liao, Raquel Urtasun, and Richard Zemel. A pac-bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations, 2020.
  • Liu et al. [2021] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, pages 6837–6849. PMLR, 2021.
  • Ma et al. [2021] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
  • Nie et al. [2011] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Unsupervised and semi-supervised learning via 𝓁\mathscr{l} 1-norm graph. In 2011 International Conference on Computer Vision, pages 2268–2273. IEEE, 2011.
  • Oono and Suzuki [2020] Kenta Oono and Taiji Suzuki. Optimization and generalization analysis of transduction through gradient boosting and application to multi-scale graph neural networks. Advances in Neural Information Processing Systems, 33:18917–18930, 2020.
  • Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • Shalev-Shwartz et al. [2010] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
  • Tibshirani [2014] Ryan J Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
  • Veličković et al. [2017] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • Verma and Zhang [2019] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1539–1548, 2019.
  • Wang et al. [2015] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan Tibshirani. Trend filtering on graphs. In Artificial Intelligence and Statistics, pages 1042–1050. PMLR, 2015.
  • Wibisono et al. [2009] Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio. Sufficient conditions for uniform stability of regularization algorithms. Computer Science and Artificial Intelligence Laboratory Technical Report, MIT-CSAIL-TR-2009-060, 2009.
  • Xu et al. [2011] Huan Xu, Constantine Caramanis, and Shie Mannor. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on Pattern Analysis and Machine Intelligence, 34(1):187–193, 2011.
  • Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  • Yun et al. [2019] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in Neural Information Processing Systems, 32:11983–11993, 2019.
  • Zhang et al. [2020] Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, and Le Song. Efficient probabilistic logic reasoning with graph neural networks. arXiv preprint arXiv:2001.11850, 2020.

The appendices are structured as follows. In Appendix A.1, we provide the proof procedure of Theorem 1. The details of proof for Lemma 2 is sketched in In Appendix B. In Appendix C, we provide additional experiments and detailed setups.

Appendix A Proof of Theorem

A.1 Proof Sketch

We describe the main procedures by training a GCN using SGD on datasets DD and DiD^{i} which differ in one data point. Let Z={𝐳1,,𝐳t,,𝐳T}Z=\{\mathbf{z}_{1},...,\mathbf{z}_{t},...,\mathbf{z}_{T}\} be a sequence of samples, where 𝐳t\mathbf{z}_{t} is an i.i.d. sample drawn from DD at the tt-th iteration of SGD during a training run of the GCN. Training the same GCN using SGD on DiD^{i} means that we endow the same sequence to the GCN except that if 𝐳t=(𝐱i,yi)\mathbf{z}_{t}=(\mathbf{x}_{i},y_{i}) occurs for some tt, we replace it with 𝐳t=(𝐱i,yi)\mathbf{z}_{t}^{\prime}=(\mathbf{x}_{i}^{\prime},y^{\prime}_{i}). We denote this sample sequence by ZZ^{\prime}. Let {𝐰D,0,𝐰D,1,,𝐰D,T}\{\mathbf{w}_{D,0},\mathbf{w}_{D,1},...,\mathbf{w}_{D,T}\} and {𝐰Di,0,𝐰Di,1,,𝐰Di,T}\{\mathbf{w}_{D^{i},0},\mathbf{w}_{D^{i},1},...,\mathbf{w}_{D^{i},T}\} be the corresponding sequences of the weight parameters learning by running SGD on DD and DiD^{i}, respectively.

Given two sequences of the weighted parameters, {𝐰D,0,𝐰D,1,,𝐰D,T}\{\mathbf{w}_{D,0},\mathbf{w}_{D,1},...,\mathbf{w}_{D,T}\} and {𝐰Di,0,𝐰Di,1,,𝐰Di,T}\{\mathbf{w}_{D^{i},0},\mathbf{w}_{D^{i},1},...,\mathbf{w}_{D^{i},T}\}. At each iteration, we define the difference of these two sequences by

Δ𝐰t:=𝐰Di,t𝐰D,t.\Delta\mathbf{w}_{t}:=\mathbf{w}_{D^{i},t}-\mathbf{w}_{D,t}.

We now relate Δ𝐰t\Delta\mathbf{w}_{t} to the stability parameter βn\beta_{n} given in Definition 1. Since the loss is Lipschitz continuous by Assumption 1, for any 𝐳=(𝐱,y)\mathbf{z}=(\mathbf{x},y) we have

|𝔼𝒜[(y,f(𝐱,𝐰D))]𝔼𝒜[(y,f(𝐱,𝐰Di))]|\displaystyle\big{|}\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D}))]-\mathbb{E}_{\mathcal{A}}[\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i}}))]\big{|}
\displaystyle\leq a𝔼𝒜[|f(𝐱,𝐰D)f(𝐱,𝐰Di)|]\displaystyle a_{\ell}\mathbb{E}_{\mathcal{A}}[|f(\mathbf{x},\mathbf{w}_{D})-f(\mathbf{x},\mathbf{w}_{D^{i}})|]
\displaystyle\leq aaσ𝔼𝒜[|jN𝐱ej𝐱jT(𝐰D𝐰D)|]\displaystyle a_{\ell}a_{\sigma}\mathbb{E}_{\mathcal{A}}\Big{[}\Big{|}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D}-\mathbf{w}_{D^{\prime}})\Big{|}\Big{]}
\displaystyle\leq aaσ𝔼𝒜[Δ𝐰2]jN𝐱ej𝐱j2,\displaystyle a_{\ell}a_{\sigma}\mathbb{E}_{\mathcal{A}}[\|\Delta\mathbf{w}\|_{2}]\cdot\big{\|}{\sum}_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}, (A.1)

where the second inequality follows from the Lipschitz property of σ()\sigma(\cdot), and the last one is based on the Cauchy-Schwarz inequality.

As proved in Verma and Zhang [2019], 𝐠e=jN𝐱ej𝐱j2\mathbf{g}_{e}=\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2} can be bounded in terms of the largest absolute eigenvalue of the graph convolution filter g(𝐋)g(\mathbf{L}). Thus, to provide an upper bound of βn\beta_{n} in uniform stability, the key point is to bound 𝔼𝒜[Δ𝐰2]\mathbb{E}_{\mathcal{A}}[\|\Delta\mathbf{w}\|_{2}] involved in (A.1).

A.2 Bounding Δ𝐰t+12\|\Delta\mathbf{w}_{t+1}\|_{2}

This subsection is devoted to bounding Δ𝐰t+12\|\Delta\mathbf{w}_{t+1}\|_{2} due to the randomness inherent in SGD. This will be proved though a series of five lemmas. Different from existing related work Verma and Zhang [2019]; Hardt et al. [2016], the strong convexity of |x|p|x|^{p} over bounded domains plays an important role in our desired results. It should be pointed out that, the first derivative of our objective function is not Lipschitz, the standard technique for proving the stability of SGD Hardt et al. [2016] is no longer applicable to our setting.

Proposition 1.

Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed algorithm (3.2)-(3.3) at step tt, satisfies

Δ𝐰t+12Cp,λ(Δ𝐰t2+η(yt,f(𝐱t,𝐰D,t))(yt,f(𝐱t,𝐰Di,t))2).\displaystyle\|\Delta\mathbf{w}_{t+1}\|_{2}\leq C_{p,\lambda}\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}. (A.2)

At step tt, 𝐳t=(𝐱t,yt)\mathbf{z}_{t}=(\mathbf{x}_{t},y_{t}) is a drawn sample from DD and 𝐳t=(𝐱t,yt)\mathbf{z}_{t}^{\prime}=(\mathbf{x}_{t}^{\prime},y_{t}^{\prime}) is a drawn sample from DiD^{i}. Here Cp,λC_{p,\lambda} is defined as Theorem 1.

Proof of Proposition 1.

For every a,ba,b\in\mathbb{R}, we know the following equality from (4) in Wibisono et al. [2009] that

|a|p+|b|p2|a+b2|p=14(ba)2p(p1)|c|p2,\displaystyle|a|^{p}+|b|^{p}-2\big{|}\frac{a+b}{2}\big{|}^{p}=\frac{1}{4}(b-a)^{2}p(p-1)|c|^{p-2}, (A.3)

where min{a,b}cmax{a,b}\min\{a,b\}\leq c\leq\max\{a,b\} and the choice of cc in our case will be given later. The equality (A.3) shows that |x|p|x|^{p} (1<p2)(1<p\leq 2) is strongly convex over bounded domains, even though its first derivative is not Lipschitz smooth.

In particular, we can set aa and bb to be the jj-th components 𝐰D,t+1j\mathbf{w}^{j}_{D,t+1} of 𝐰D,t+1\mathbf{w}_{D,t+1} and 𝐰Di,t+1j\mathbf{w}^{j}_{D^{i},t+1} of 𝐰Di,t+1\mathbf{w}_{D^{i},t+1} respectively. Then we obtain

|𝐰D,t+1j|p+|𝐰Di,t+1j|p2|𝐰D,t+1j+𝐰Di,t+1j2|p=14(𝐰D,t+1j𝐰Di,t+1j)2p(p1)|cj|p2,\displaystyle|\mathbf{w}^{j}_{D,t+1}|^{p}+|\mathbf{w}^{j}_{D^{i},t+1}|^{p}-2\Big{|}\frac{\mathbf{w}^{j}_{D,t+1}+\mathbf{w}^{j}_{D^{i},t+1}}{2}\Big{|}^{p}=\frac{1}{4}(\mathbf{w}^{j}_{D,t+1}-\mathbf{w}^{j}_{D^{i},t+1})^{2}p(p-1)|c_{j}|^{p-2}, (A.4)

where min{𝐰D,t+1j,𝐰Di,t+1j}cjmax{𝐰D,t+1j,𝐰Di,t+1j}\min\{\mathbf{w}^{j}_{D,t+1},\mathbf{w}^{j}_{D^{i},t+1}\}\leq c_{j}\leq\max\{\mathbf{w}^{j}_{D,t+1},\mathbf{w}^{j}_{D^{i},t+1}\}. As stated in (3.5), we can derive a bound cjc_{j} as follows

|cj|max{|𝐰D,t+1j|,|𝐰Di,t+1j|}(Bλ)1/p.|c_{j}|\leq\max\{|\mathbf{w}^{j}_{D,t+1}|,|\mathbf{w}^{j}_{D^{i},t+1}|\}\leq\Big{(}\frac{B}{\lambda}\Big{)}^{1/p}.

Furthermore, since 1<p21<p\leq 2, this follows that

|cj|p2(Bλ)(p2)/p,j[d].|c_{j}|^{p-2}\geq\Big{(}\frac{B}{\lambda}\Big{)}^{(p-2)/p},\quad\forall\,j\in[d].

Substituting the bound on cjc_{j} to (A.4) and then summing over j[d]j\in[d], we obtain

𝐰D,t+1pp+𝐰Di,t+1pp2𝐰D,t+1+𝐰Di,t+12pp\displaystyle\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-2\Big{\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\|}^{p}_{\ell_{p}}
14p(p1)(Bλ)(p2)/p𝐰D,t+1𝐰Di,t+122.\displaystyle\geq\frac{1}{4}p(p-1)\Big{(}\frac{B}{\lambda}\Big{)}^{(p-2)/p}\|\mathbf{w}_{D,t+1}-\mathbf{w}_{D^{i},t+1}\|_{\ell_{2}}^{2}. (A.5)

Seen from (A.2), it suffices to bound the term

𝐰D,t+1pp+𝐰Di,t+1pp2𝐰D,t+1+𝐰Di,t+12pp,\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-2\Big{\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\|}^{p}_{\ell_{p}},

which is completed by the following lemma. ∎

Lemma 4.

For any θ[0,1]\theta\in[0,1], we have

λt(𝐰D,t+1pp𝐰D,t+1+θΔ𝐰t+1pp+𝐰Di,t+1pp𝐰Di,t+1θΔ𝐰t+1pp)\displaystyle\lambda_{t}\big{(}\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}-\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}\big{)}
7(B/λ)1/p(Δ𝐰t2+η(yt,f(𝐱t,𝐰D,t))(yt,f(𝐱t,𝐰Di,t))2).\displaystyle\leq 7(B/\lambda)^{1/p}\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}.
Proof.

The proof for Lemma 4 is provided in Appendix B.2. ∎

Now we proceed with the proof of Proposition 1. Applying Lemma 4 with θ=12\theta=\frac{1}{2} yields that

λt(𝐰D,t+1pp+𝐰Di,t+1pp2𝐰D,t+1+𝐰Di,t+12pp)\displaystyle\lambda_{t}\Big{(}\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-2\Big{\|}\frac{\mathbf{w}_{D,t+1}+\mathbf{w}_{D^{i},t+1}}{2}\Big{\|}^{p}_{\ell_{p}}\Big{)}
7(Bλ)1/p(Δ𝐰t2+η(yt,f(𝐱t,𝐰D,t))(yt,f(𝐱t,𝐰Di,t))2).\displaystyle\leq 7\big{(}\frac{B}{\lambda}\big{)}^{1/p}\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}.

Combining the inequality above with (A.2), we get

λt28p(p1)(Bλ)(p3)/p𝐰D,t+1𝐰Di,t+122Δ𝐰t2+η(yt,f(𝐱t,𝐰D,t))(yt,f(𝐱t,𝐰Di,t))2,\displaystyle\frac{\lambda_{t}}{28}p(p-1)\big{(}\frac{B}{\lambda}\big{)}^{(p-3)/p}\|\mathbf{w}_{D,t+1}-\mathbf{w}_{D^{i},t+1}\|_{\ell_{2}}^{2}\leq\|\Delta\mathbf{w}_{t}\|_{2}+\eta\big{\|}\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\big{\|}_{2},

which immediately yields our desired result in Proposition 1.

Since 𝐳t\mathbf{z}_{t} and 𝐳t\mathbf{z}_{t}^{\prime} are two random samples from different datasets (DD and DiD^{i}) respectively, we need to consider one of the following two scenarios, which must occur at iteration tt.

(Scenario 1) At step tt, SGD picks a sample 𝐳t=𝐳t=(𝐱,y)\mathbf{z}_{t}=\mathbf{z}_{t}^{\prime}=(\mathbf{x},y) which is identical in ZZ and ZZ^{\prime}, occurring with probability (n1)/n(n-1)/n.

(Scenario 2) At step tt, SGD picks the only samples that ZZ and ZZ^{\prime} differs, e.g. 𝐳t𝐳t\mathbf{z}_{t}\neq\mathbf{z}_{t}^{\prime}, which occurs with probability 1/n1/n.

Consider the Scenario 11 in Lemma 5 below. In this case, we will show that (y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))2\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\|_{2} can be bounded by the terms Δ𝐰t2\|\Delta\mathbf{w}_{t}\|_{2} and 𝐠e2\mathbf{g}_{e}^{2}.

Lemma 5.

Under Assumption 1 for Scenario 1. There holds

(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))2(aσ2+a)jN𝐱ej𝐱j22Δ𝐰t2.\displaystyle\big{\|}\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\big{\|}_{2}\leq(a_{\sigma}^{2}+a_{\ell})\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}^{2}\|\Delta\mathbf{w}_{t}\|_{2}.
Proof.

The proof for Lemma 5 is provided in Appendix B.3. ∎

Unlike Scenario 1 given in Lemma 5, the following lemma gives a more rough bound of the gradient difference under Scenario 2.

Lemma 6.

Under Assumption 1 for Scenario 2. There holds

(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))22aaσsup𝐱jN𝐱ej𝐱j2.\displaystyle\big{\|}\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\big{\|}_{2}\leq 2a_{\ell}a_{\sigma}\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}.
Proof.

The proof for Lemma 6 is also provided in Appendix B.4. ∎

Substituting the results derived in Lemma 5 and Lemma 6 to Proposition 1, and taking expectation over all possible sample sequences 𝐳,𝐳\mathbf{z},\mathbf{z}^{\prime} from DD and DiD^{i}, we have the following lemma.

Lemma 7.

Suppose that the loss and activation functions are Lipschitz-continuous and smooth functions (Assumption 1). Then a single layer GCN model, trained by the proposed algorithm (3.2)-(3.3) at step tt, satisfies

𝔼SGD[Δ𝐰T2]2aaσηCp,λ𝐠ent=1T(Cp,λ(1+(aσ2+a)η𝐠e2))t1.\displaystyle\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{T}\|_{2}\big{]}\leq\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1}.
Proof of Lemma 7..

From (A.2) in Proposition 1, this together with the derived results in Lemma 5 and Lemma 6 yields that, 𝔼SGD[Δ𝐰t+12]\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t+1}\|_{2}\big{]} can be upper bounded by

Cp,λ(𝔼SGD[Δ𝐰t2]+𝔼SGD[η(yt,f(𝐱t,𝐰D,t))(yt,f(𝐱t,𝐰Di,t))2])\displaystyle C_{p,\lambda}\Big{(}\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t}\|_{2}\big{]}+\mathbb{E}_{SGD}\big{[}\eta\|\nabla\ell(y_{t},f(\mathbf{x}_{t},\mathbf{w}_{D,t}))-\nabla\ell(y_{t}^{\prime},f(\mathbf{x}_{t}^{\prime},\mathbf{w}_{D^{i},t}))\|_{2}\big{]}\Big{)}
Cp,λ𝔼SGD[Δ𝐰t2]+2aaσCp,ληnsup𝐱jN𝐱ej𝐱j2\displaystyle\leq C_{p,\lambda}\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t}\|_{2}\big{]}+2a_{\ell}a_{\sigma}C_{p,\lambda}\frac{\eta}{n}\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}
+(aσ2+a)Cp,λη(11n)jN𝐱ej𝐱j22𝔼SGD[Δ𝐰t2]\displaystyle\quad+(a_{\sigma}^{2}+a_{\ell})C_{p,\lambda}\eta\big{(}1-\frac{1}{n}\big{)}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}^{2}\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t}\|_{2}\big{]}
=Cp,λ(1+(aσ2+a)η(11n)𝐠e2)𝔼SGD[Δ𝐰t2]+2aaσηCp,λ𝐠en.\displaystyle=C_{p,\lambda}\Big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\big{(}1-\frac{1}{n}\big{)}\mathbf{g}_{e}^{2}\Big{)}\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{t}\|_{2}\big{]}+\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}.

Finally, an application of the recursion for the above inequality leads to

𝔼SGD[Δ𝐰T2]2aaσηCp,λ𝐠ent=1T(Cp,λ(1+(aσ2+a)η𝐠e2))t1,\displaystyle\mathbb{E}_{SGD}\big{[}\|\Delta\mathbf{w}_{T}\|_{2}\big{]}\leq\frac{2a_{\ell}a_{\sigma}\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1},

where we use the fact that the parameter initialization is kept same for DD and DiD^{i}, e.g., 𝐰D,0=𝐰Di,0\mathbf{w}_{D,0}=\mathbf{w}_{D^{i},0}. Thus, the proof for Lemma 7 is completed. ∎

Appendix B Proof for Lemmas

B.1 Deferred Proof for Lemma 2.

Proof for Lemma 2..

Since the regularization term 𝐰pp\|\mathbf{w}\|_{\ell_{p}}^{p} is continuously differential and specially we get (|wj|p)=psign(wj)|wj|p1\partial(|w_{j}|^{p})=p\cdot\hbox{sign}(w_{j})\cdot|w_{j}|^{p-1} , where 𝐰=(w1,w2,,wd)\mathbf{w}=(w_{1},w_{2},...,w_{d}). Note that all the terms in (3.4) are separable on wjw_{j}’s, and using the first-order condition for (3.4) leads to

wjvj+λpsign(wj)|wj|p1=0,j=1,2,d.\displaystyle w_{j}^{*}-v_{j}+\lambda p\cdot\hbox{sign}(w_{j}^{*})\cdot|w_{j}^{*}|^{p-1}=0,\quad j=1,2...,d. (B.1)

If wj>0w^{*}_{j}>0, this follows from (B.1) that

wj+λp|wj|p1=vj,w_{j}^{*}+\lambda p\cdot|w_{j}^{*}|^{p-1}=v_{j},

implying that

0<wjmin{vj,(vj/(λp))1/(p1)}.0<w^{*}_{j}\leq\min\{v_{j},(v_{j}/(\lambda p))^{1/(p-1)}\}.

Otherwise, we obtain from (B.1) that

|wj|λp|wj|p1=vj,-|w_{j}^{*}|-\lambda p\cdot|w_{j}^{*}|^{p-1}=v_{j},

implying that

|wj|min{|vj|,(|vj|/(λp))1/(p1)}.|w_{j}^{*}|\leq\min\{|v_{j}|,(|v_{j}|/(\lambda p))^{1/(p-1)}\}.

Summing the previous two conclusions leads to our desired result. ∎

B.2 Deferred Proof for Lemma 4.

Proof for Lemma 4..

Let us introduce the notation:

RD(𝐰):=12𝐰𝐯D,it22,R_{D}(\mathbf{w}):=\frac{1}{2}\|\mathbf{w}-\mathbf{v}_{D,i_{t}}\|_{2}^{2},

and similarly

RDi(𝐰):=12𝐰𝐯Di,it22.R_{D^{i}}(\mathbf{w}):=\frac{1}{2}\|\mathbf{w}-\mathbf{v}_{D^{i},i_{t}}\|_{2}^{2}.

Recall that a convex function gg admits the following inequality:

g(𝐱+θ(𝐮𝐱))g(𝐱)θ(g(𝐮)g(𝐱)),𝐱,𝐮.g(\mathbf{x}+\theta(\mathbf{u}-\mathbf{x}))-g(\mathbf{x})\leq\theta(g(\mathbf{u})-g(\mathbf{x})),\quad\forall\,\mathbf{x},\mathbf{u}.

The convexity of RD(𝐰)R_{D}(\mathbf{w}) and RDi(𝐰)R_{D^{i}}(\mathbf{w}) immediately leads to

RDi(𝐰D,t+1+θΔ𝐰t+1)RDi(𝐰D,t+1)θ(RDi(𝐰Di,t+1)RDi(𝐰D,t+1)).\displaystyle R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})\leq\theta\big{(}R_{D^{i}}(\mathbf{w}_{D^{i},t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})\big{)}.

Then, switching the role of 𝐰D,t+1\mathbf{w}_{D,t+1} and 𝐰Di,t+1\mathbf{w}_{D^{i},t+1}, we get

RDi(𝐰Di,t+1θΔ𝐰t+1)RDi(𝐰Di,t+1)θ(RDi(𝐰D,t+1)RDi(𝐰Di,t+1)).\displaystyle R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\leq\theta\big{(}R_{D^{i}}(\mathbf{w}_{D,t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\big{)}.

Summing the two previous inequalities yields

RDi(𝐰D,t+1+θΔ𝐰t+1)RDi(𝐰D,t+1)+RDi(𝐰Di,t+1θΔ𝐰t+1)RDi(𝐰Di,t+1)0.\displaystyle R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1})+R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D^{i},t+1})\leq 0. (B.2)

Now, by the definitions of 𝐰D,t+1\mathbf{w}_{D,t+1} and 𝐰Di,t+1\mathbf{w}_{D^{i},t+1}, we have

RD(𝐰D,t+1)+λt𝐰D,t+1pp[RD(𝐰D,t+1+θΔ𝐰t+1)+λt𝐰D,t+1+θΔ𝐰t+1pp]0,\displaystyle R_{D}(\mathbf{w}_{D,t+1})+\lambda_{t}\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}-\big{[}R_{D}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})+\lambda_{t}\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}\big{]}\leq 0,
RDi(𝐰Di,t+1)+λt𝐰Di,t+1pp[RDi(𝐰Di,t+1θΔ𝐰t+1)+λt𝐰Di,t+1θΔ𝐰t+1pp]0.\displaystyle R_{D^{i}}(\mathbf{w}_{D^{i},t+1})+\lambda_{t}\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-\big{[}R_{D^{i}}(\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1})+\lambda_{t}\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}\big{]}\leq 0.

Then, summing the two previous inequalities and using (B.2) and (3.5), we obtain

λt(𝐰D,t+1pp𝐰D,t+1+θΔ𝐰t+1pp+𝐰Di,t+1pp𝐰Di,t+1θΔ𝐰t+1pp)\displaystyle\lambda_{t}\big{(}\|\mathbf{w}_{D,t+1}\|_{\ell_{p}}^{p}-\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}+\|\mathbf{w}_{D^{i},t+1}\|_{\ell_{p}}^{p}-\|\mathbf{w}_{D^{i},t+1}-\theta\Delta\mathbf{w}_{t+1}\|_{\ell_{p}}^{p}\big{)}
RD(𝐰D,t+1+θΔ𝐰t+1)RDi(𝐰D,t+1+θΔ𝐰t+1)+RDi(𝐰D,t+1)RD(𝐰D,t+1)\displaystyle\leq R_{D}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})-R_{D^{i}}(\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1})+R_{D^{i}}(\mathbf{w}_{D,t+1})-R_{D}(\mathbf{w}_{D,t+1})
(𝐰D,t+1+θΔ𝐰t+1𝐯D,it2+𝐰D,t+1𝐯Di,it2)𝐯D,it𝐯Di,it2\displaystyle\leq\big{(}\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\|_{2}+\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\|_{2}\big{)}\|\mathbf{v}_{D,i_{t}}-\mathbf{v}_{D^{i},i_{t}}\|_{2}
(𝐰D,t+1+θΔ𝐰t+1𝐯D,it2+𝐰D,t+1𝐯Di,it2)\displaystyle\leq\big{(}\|\mathbf{w}_{D,t+1}+\theta\Delta\mathbf{w}_{t+1}-\mathbf{v}_{D,i_{t}}\|_{2}+\|\mathbf{w}_{D,t+1}-\mathbf{v}_{D^{i},i_{t}}\|_{2}\big{)}
×(Δ𝐰t2+η(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))2)\displaystyle\quad\times\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}
7(B/λ)1/p(Δ𝐰t2+η(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))2),\displaystyle\leq 7(B/\lambda)^{1/p}\big{(}\|\Delta\mathbf{w}_{t}\|_{2}+\eta\|\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\|_{2}\big{)}, (B.3)

where we use the well-known property that projection onto a convex set 𝒞\mathcal{C} is non-expansive, that is, for any two points 𝐱,𝐮\mathbf{x},\mathbf{u}, Π𝒞(𝐱)Π𝒞(𝐮)2𝐱𝐮2\|\Pi_{\mathcal{C}}(\mathbf{x})-\Pi_{\mathcal{C}}(\mathbf{u})\|_{2}\leq\|\mathbf{x}-\mathbf{u}\|_{2}. This completes the proof of Lemma 4. ∎

B.3 Deferred Proof for Lemma 5.

Proof for Lemma 5..

Recall that, the predictor function is defined as f(𝐱,𝐰)=σ(jN𝐱ej𝐱jT𝐰)f(\mathbf{x},\mathbf{w})=\sigma\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)}. The gradient of f(𝐱,𝐰)f(\mathbf{x},\mathbf{w}) can be computed by

f(𝐱,𝐰)=σ(jN𝐱ej𝐱jT𝐰)jN𝐱ej𝐱j.\nabla f(\mathbf{x},\mathbf{w})=\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}\big{)}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}.

Using the chain-rule of compositional functions, we have

(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))\displaystyle\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))
=(y,f(𝐱,𝐰D,t))f(𝐱,𝐰D,t)(y,f(𝐱,𝐰Di,t))f(𝐱,𝐰Di,t)\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\nabla f(\mathbf{x},\mathbf{w}_{D,t})-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\nabla f(\mathbf{x},\mathbf{w}_{D^{i},t})
=[(y,f(𝐱,𝐰D,t))σ(jN𝐱ej𝐱jT𝐰D,t)(y,f(𝐱,𝐰Di,t))σ(jN𝐱ej𝐱jT𝐰Di,t)]×(jN𝐱ej𝐱j).\displaystyle=\Big{[}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{]}\times\Big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\Big{)}. (B.4)

Under Assumption 1, an application of the triangle inequality leads to

|(y,f(𝐱,𝐰D,t))σ(jN𝐱ej𝐱jT𝐰D,t)(y,f(𝐱,𝐰Di,t))σ(jN𝐱ej𝐱jT𝐰Di,t)|\displaystyle\Big{|}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{|}
|[(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))]σ(jN𝐱ej𝐱jT𝐰D,t)|\displaystyle\leq\Big{|}\big{[}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\big{]}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}\Big{|}
+|[σ(jN𝐱ej𝐱jT𝐰D,t)σ(jN𝐱ej𝐱jT𝐰Di,t)](y,f(𝐱,𝐰Di,t))|\displaystyle\quad+\Big{|}\Big{[}\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}-\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{]}\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D^{i},t}))\Big{|}
aσ|f(𝐱,𝐰D,t)f(𝐱,𝐰Di,t)|+a|jN𝐱ej𝐱jT(𝐰D,t𝐰Di,t)|\displaystyle\leq a_{\sigma}\big{|}f(\mathbf{x},\mathbf{w}_{D,t})-f(\mathbf{x},\mathbf{w}_{D^{i},t})\big{|}+a_{\ell}\Big{|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{|}
(aσ2+a)|jN𝐱ej𝐱jT(𝐰D,t𝐰Di,t)|\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\Big{|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}(\mathbf{w}_{D,t}-\mathbf{w}_{D^{i},t})\Big{|}
(aσ2+a)jN𝐱ej𝐱j2Δ𝐰t2.\displaystyle\leq(a_{\sigma}^{2}+a_{\ell})\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}\|\Delta\mathbf{w}_{t}\|_{2}.

This together with (B.3) yields the desired result. ∎

B.4 Deferred Proof for Lemma 6.

Proof for Lemma 6..

Again using the chain-rule of compositional functions, we have

(y,f(𝐱,𝐰D,t))(y,f(𝐱,𝐰Di,t))\displaystyle\nabla\ell(y,f(\mathbf{x},\mathbf{w}_{D,t}))-\nabla\ell(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))
=(y,f(𝐱,𝐰D,t))f(𝐱,𝐰D,t)(y,f(𝐱,𝐰Di,t))f(𝐱,𝐰Di,t)\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\nabla f(\mathbf{x},\mathbf{w}_{D,t})-\ell^{\prime}(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\nabla f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t})
=(y,f(𝐱,𝐰D,t))σ(jN𝐱ej𝐱jT𝐰D,t)(jN𝐱ej𝐱j)(y,f(𝐱,𝐰Di,t))σ(jN𝐱ej(𝐱)jT𝐰Di,t)(jN𝐱ej𝐱j)\displaystyle=\ell^{\prime}(y,f(\mathbf{x},\mathbf{w}_{D,t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}^{T}\mathbf{w}_{D,t}\big{)}\Big{(}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\Big{)}-\ell^{\prime}(y^{\prime},f(\mathbf{x}^{\prime},\mathbf{w}_{D^{i},t}))\sigma^{\prime}\big{(}\sum_{j\in N_{\mathbf{x}^{\prime}}}e_{\cdot j}(\mathbf{x}^{\prime})_{j}^{T}\mathbf{w}_{D^{i},t}\big{)}\Big{(}\sum_{j\in N_{\mathbf{x}}^{\prime}}e_{\cdot j}\mathbf{x}^{\prime}_{j}\Big{)}
2aaσsup𝐱jN𝐱ej𝐱j2,\displaystyle\leq 2a_{\ell}a_{\sigma}\sup_{\mathbf{x}}\big{\|}\sum_{j\in N_{\mathbf{x}}}e_{\cdot j}\mathbf{x}_{j}\big{\|}_{2}, (B.5)

where we use the fact from Assumption 1 that the first order derivatives of \ell and σ\sigma are both bounded. This completes the proof of Lemma 6. ∎

B.5 Additional Lemma

Lemma 8.

Let G=(V,E)G=(V,E) be an undirected graph with a weighted adjacency matrix g(𝐋)g(\mathbf{L}) and λGmax\lambda^{max}_{G} be the maximum absolute eigenvalue of g(𝐋)g(\mathbf{L}). Let G𝐱G_{\mathbf{x}} be the ego-graph of a node 𝐱V\mathbf{x}\in V with corresponding maximum absolute eigenvalue λG𝐱max\lambda^{max}_{G_{\mathbf{x}}}. Then the following eigenvalue bound holds for all 𝐱\mathbf{x},

λG𝐱maxλGmax.\lambda^{max}_{G_{\mathbf{x}}}\leq\lambda^{max}_{G}.

Lemma 8 is proved in reference Verma and Zhang [2019], where they also showed that 𝐠e\mathbf{g}_{e} can be upper bounded in term of the largest absolute eigenvalue of g𝐱(𝐋)g_{\mathbf{x}}(\mathbf{L}). That is, 𝐠eλG𝐱max\mathbf{g}_{e}\leq\lambda^{max}_{G_{\mathbf{x}}} for all 𝐱\mathbf{x}. This follows from Lemma 8 that

𝐠eλGmax.\displaystyle\mathbf{g}_{e}\leq\lambda^{max}_{G}. (B.6)

Finally, plugging (B.6) and Lemma 7 into (A.1) yields the following result:

βna2aσ2λGmaxηCp,λ𝐠ent=1T(Cp,λ(1+(aσ2+a)η𝐠e2))t1.\displaystyle\beta_{n}\leq a_{\ell}^{2}a_{\sigma}^{2}\lambda^{max}_{G}\frac{\eta C_{p,\lambda}\mathbf{g}_{e}}{n}\sum_{t=1}^{T}\Big{(}C_{p,\lambda}\big{(}1+(a_{\sigma}^{2}+a_{\ell})\eta\mathbf{g}_{e}^{2}\big{)}\Big{)}^{t-1}. (B.7)

Appendix C Additional Experiments and Setup Details

C.1 General Setup

We conduct experiments on three citation network datasets: Citeseer, Cora, and Pubmed Sen et al. [2008]. In every datasets, each document represents as spare bag-of-words feature vector, and the relationship between documents consists of a list of citation links, which can be treated as undirected edges and help to construct the adjacency matrix. These documents can be divided into different classes and have the corresponding class label. The data statistics for the datasets used in Section 5 are summarized in Table 2.

Dataset Nodes Edges Classes Features
Citeseer 3,3273,327 4,7324,732 66 3,7033,703
Cora 2,7082,708 5,4295,429 77 1,4331,433
Pubmed 19,71719,717 44,33844,338 33 500500
Table 2: Dataset Statistics.

C.2 Additional Experimental Results

In this section, we investigate the stability and generalization of p\ell_{p}-regularized stochastic learning for GCN under different normalization steps. We mainly consider the implication of our results in following designing graph convolution filters:

  1. (I)

    Unnormalized Graph Filters: g(𝐋)=𝐀+𝐈g(\mathbf{L})=\mathbf{A}+\mathbf{I};

  2. (II)

    Normalized Graph Filters: g(𝐋)=𝐃1/2𝐀𝐃1/2+𝐈g(\mathbf{L})=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}+\mathbf{I};

  3. (III)

    Random Walk Graph Filters: g(𝐋)=𝐃1𝐀+𝐈g(\mathbf{L})=\mathbf{D}^{-1}\mathbf{A}+\mathbf{I};

  4. (IV)

    Augmented Normalized Graph Filters: g(𝐋)=(𝐃+𝐈)1/2(𝐀+𝐈)(𝐃+𝐈)1/2g(\mathbf{L})=(\mathbf{D}+\mathbf{I})^{-1/2}(\mathbf{A}+\mathbf{I})(\mathbf{D}+\mathbf{I})^{-1/2}.

For each experiment, we initialize the parameters of GCN models with the same random seeds and then train all models for a maximum of 200 epochs using the proposed Inexact Proximal SGD. We repeat the experiments ten times and report the average performance and the standard variance. We quantitatively measure the generalization gap defined as the absolute difference between the training and test errors and the parameter distance 𝐰^𝐰^2/(𝐰^2+𝐰^2)\sqrt{\nicefrac{{\|\widehat{\mathbf{w}}-\widehat{\mathbf{w}}^{\prime}\|^{2}}}{{\big{(}\|\widehat{\mathbf{w}}\|^{2}+\|\widehat{\mathbf{w}}^{\prime}\|^{2}\big{)}}}} between the parameters 𝐰^\widehat{\mathbf{w}} and 𝐰^\widehat{\mathbf{w}}^{\prime} of two models trained on two copies of the data differing in a random substitution.

Refer to caption
Figure 3: Normalized euclidean distance between parameters of two models trained under various pp on three datasets.
Refer to caption
Figure 4: Generalization Gaps under different normalization steps on three datasets.

The unnormalized graph convolution filters and Random Walk Graph Filters show a significantly higher generalization gap than the normalized ones. The results hold consistently across the three datasets under different pp. Hence, these empirical results are also consistent with our generalization error bound. We note that the generalization gap becomes constant after a certain number of iterations.