This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks

Frederik Wenkel, Yimeng Min, Matthew Hirn, Michael Perlmutter, and Guy Wolf†,∗∗ () The first two authors contributed equally. () The last two authors jointly supervised the work. (∗∗) Corresponding author.F. Wenkel and G. Wolf are with Mila (Quebec Artificial Intelligence Institute) and Department of Mathematics and Statistics, University of Montreal, Montreal, QC H3T1J4 (e-mail: [email protected], [email protected])Y. Min is with the Department of Computer Science, Cornell University, New York, NY 14850, USA (e-mail: [email protected])M. Hirn is with Department of Computational Mathematics, Science & Engineering, the Department of Mathematics, and the Center for Quantum Computing, Science & Engineering, Michigan State University, East Lansing, MI 48824, USA, and the Midwest Quantum Collaboratory (e-mail: [email protected])M. Perlmutter is with Department of Mathematics, University of California, Los Angeles, CA 90095, USA (e-mail: [email protected])
Abstract

Geometric deep learning has made great strides towards generalizing the design of structure-aware neural networks from traditional domains to non-Euclidean ones, giving rise to graph neural networks (GNN) that can be applied to graph-structured data arising in, e.g., social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models are often constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, leading to oversmoothing. Moreover, to avoid severe oversmoothing, most popular GCN-style networks tend to be shallow, with narrow receptive fields, leading to underreaching. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via geometric scattering. We further introduce an attention framework that allows the model to locally attend over combined information from different filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.

Index Terms:
Geometric Deep Learning, Graph Neural Networks, Geometric Scattering, Oversmoothing, Underreaching.

1 Introduction

Deep learning is typically most effective when the structure of the data can be used to design the architecture of the relevant network. For example, the design of recurrent neural networks is informed by the sequential nature of time-series data. Similarly, the design of convolutional neural networks is based in part on the fact that the pixels of an image are arranged in a rectangular grid. The success of neural networks in these, as well as many other applications, has inspired the rise of geometric deep learning[1, 2], which aims to extend the success of deep learning to other forms of structured data and to develop intelligent methods for data sets that have a non-Euclidean structure.

A common approach in geometric deep learning is to model the data by a graph. In many applications, this is done by defining edges between data points that interact in a specific way, e.g., “friends” on a social network. In many other applications, one may construct a graph from a high-dimensional data set, either by defining an edge between each point and its kk-nearest neighbors or by defining weighted edges via a similarity kernel. Inspired by the increasing ubiquity of graph-structured data, numerous recent works have shown graph neural networks (GNNs) to perform well in a variety of fields including biology, chemistry and social networks [3, 4, 5]. In these methods, the graph is often considered in conjunction with a set of node features, which contain “local” information about, e.g., each user of a social network.

One common family of tasks are graph-level tasks, where one seeks to learn a whole-graph representation for the purposes of, e.g., predicting properties of proteins [6, 7, 8]. Another common family, which has been the primary focus of graph convolutional networks (GCNs) [5], are node-level tasks such as node classification. There, the entire data set is modeled as one large graph and the network aims to produce a useful representation of each node using both the node features and the graph structure. This work is typically conducted in the semi-supervised setting where one only knows the labels of a small fraction of the nodes.

Many popular state-of-the-art GNNs essentially aim to promote similarity between adjacent nodes, which may be interpreted as a smoothing operation. While this is effective in certain settings, it can also cause a decrease in performance because of the oversmoothing problem [9], where nodes become increasingly indistinguishable from one another after each subsequent layer. In order to partially mitigate the oversmoothing problem, many popular GNNs only use two or three layers. While this does help avoid oversmoothing, it creates a new problem, underreaching, where the network is unable to incorporate long-range dependencies or global geometry.

Here, we propose to augment traditional GNN architectures by also including novel band-pass filters, in addition to conventional GCN-style filters111Throughout this text, we will use the term GCN to refer to the network introduced by Kipf and Welling in [5]. We will use the term GNN to refer to graph neural networks (spectral, convolutional, or otherwise) in general. that essentially perform low-pass filtering [10], in order to extract richer representations for each node. This approach is based on the geometric scattering transform [11, 12, 13], whose construction is inspired by the Euclidean scattering transform introduced by Mallat in [14], and utilizes iterative cascades of graph wavelets [15, 16] and pointwise nonlinear activation functions to produce useful graph data representations. Notably, these representations are able to capture both high-frequency information and long-range dependencies.

The main contribution of this work is two hybrid GNN frameworks that utilize both traditional GCN-style filters and also novel filters based on the scattering transform. This approach is based on the following simple idea: GCN-based networks are very useful, but as they aim to enforce similarity among nodes, they essentially focus on low-frequency information. Wavelets, on the other hand, are naturally equipped to capture high-frequency information. In the spatial domain, the GCN-style filters can be thought of as focusing on localized information where the wavelets are able to capture longer-range interactions. Therefore, in a hybrid network, the different channels capture different types of information. Such a network is therefore more powerful than a network that only uses one style of filter.

We also introduce complementary GNN modules that enhance the performance of such hybrid scattering models, including (i) the graph residual convolution, an adaptive low-pass filter that corrects high-frequency noise, and (ii) an attention framework that enables the aggregation of information from different filters individually at every node. We present theoretical results, based off of a new notion of graph structural difference, that highlight the sensitivity of scattering filters to graph regularity patterns not captured by GCN filters due to underreaching. Extensive empirical experiments demonstrate the ability of hybrid scattering networks for (transductive) semi-supervised node classification, to (i) alleviate oversmoothing and (ii) generalize to complex (low-homophily) datasets. Moreover, we also show that our framework translates well to inductive graph-level tasks.

The remainder of this paper is organized as follows. We review related work on GNN models and geometric scattering in Sec. 2 and introduce important concepts that will be used throughout this work in Sec. 3. We then formulate the hybrid scattering network in Sec. 4, followed by a theoretical study of the benefits of such models in Sec. 5. In Sec. 6, we present empirical results before concluding in Sec. 7.

2 Related Work

Theoretical analyses[9, 10] of GCN and related models show that they may be viewed as Laplacian smoothing operations and, from the signal processing perspective, essentially perform low-pass filters on the graph features. One approach towards addressing this problem is the graph attention network proposed by [17], which uses self-attention mechanisms to address these shortcomings by adaptively reweighting local neighborhoods. In [18], the authors construct a low-rank approximation of the graph Laplacian that efficiently gathers multi-scale information and demonstrate the effectiveness of their method on citation networks and the QM8 quantum chemistry dataset. In [19], the authors take an approach similar to GCN, but use multiple powers of the adjacency matrix to learn higher-order neighborhood information. Finally, in [20] the authors used graph wavelets to extract higher-order neighborhood.

In addition to the learned networks discussed above, several works[11, 12, 13] have introduced different variations of the graph scattering transform. These papers aim to extend the Euclidean scattering transform of Mallat [14] to graph-structured data and propose predesigned, wavelet-based networks. In [13, 11, 21, 22], extensive theoretical studies of these networks show that they have desirable stability, invariance, and conservation of energy properties. The practical utility of these networks has been established in [12], which primarily focuses on graph classification, and in [23, 24, 25], which used the graph scattering transform to generate molecules. Building off of these results, which use handcrafted formulations of the scattering transform, recent work [26] has proposed a framework for a data-driven tuning of the traditionally handcrafted geometric scattering design that maintains the theoretical properties from traditional designs, while also showing strong empirical results in whole-graph settings.

3 Geometric Deep Learning Background

3.1 Graph Signal Processing

Let G=(V,E,w)G=(V,E,w) be a weighted graph, characterized by a set of nodes (also called vertices) V{v1,,vn}V\coloneqq\{v_{1},\dots,v_{n}\}, a set of undirected edges EV×VE\subseteq V\times V, and a function w:E(0,)w:E\to(0,\infty) assigning positive edge weights to the edges. Let 𝐗n×dX\operatorname{\boldsymbol{X}}\in\operatorname{\mathbb{R}}^{n\times d_{X}} be a node features matrix. We shall interpret the ithi^{\text{th}} row of 𝐗\operatorname{\boldsymbol{X}} as representing the features of the node viv_{i}, and therefore, we shall denote these rows by either 𝐗[vi]1×dX\operatorname{\boldsymbol{X}}[v_{i}]\in\operatorname{\mathbb{R}}^{1\times d_{X}} or 𝐗vi1×dX\operatorname{\boldsymbol{X}}_{v_{i}}\in\operatorname{\mathbb{R}}^{1\times d_{X}}. The columns of 𝐗\operatorname{\boldsymbol{X}}, on the other hand, will be denoted by 𝐱i\operatorname{\boldsymbol{x}}_{i}. Each of these columns may be naturally identified with a graph signal, i.e., a function xi:Vx_{i}:V\rightarrow\mathbb{R}, xi(vj)𝐱i[j]x_{i}(v_{j})\coloneqq\operatorname{\boldsymbol{x}}_{i}[j]. In what follows, for simplicity, we will not distinguish between the vectors 𝐱i\operatorname{\boldsymbol{x}}_{i} and the functions xix_{i} and will refer to both as graph signals.

We define the weighted adjacency matrix 𝐖n×n\operatorname{\boldsymbol{W}}\in\operatorname{\mathbb{R}}^{n\times n} of GG by 𝐖[vi,vj]w(vi,vj)\operatorname{\boldsymbol{W}}[v_{i},v_{j}]\coloneqq w(v_{i},v_{j}) if {vi,vj}E\{v_{i},v_{j}\}\in E, and set all other entries to zero. We further define the degree matrix 𝐃n×n\operatorname{\boldsymbol{D}}\in\operatorname{\mathbb{R}}^{n\times n} as the diagonal matrix 𝐃diag(deg(v1),,deg(vn))\operatorname{\boldsymbol{D}}\coloneqq\operatorname{diag}(\operatorname{deg}(v_{1}),\dots,\operatorname{deg}(v_{n})) with each diagonal element deg(vi)j=1n𝐖[vi,vj]\operatorname{deg}(v_{i})\coloneqq\sum_{j=1}^{n}\operatorname{\boldsymbol{W}}[v_{i},v_{j}] being the degree of a node viv_{i}. In the following, we will also use the shorthand dvid_{v_{i}} to denote the degree of viv_{i}. We consider the combinatorial graph Laplacian matrix 𝐋𝐃𝐖\operatorname{\boldsymbol{L}}\coloneqq\operatorname{\boldsymbol{D}}-\operatorname{\boldsymbol{W}} and the symmetric normalized Laplacian given by

𝓛𝐃1/2𝐋𝐃1/2=𝐈n𝐃1/2𝐖𝐃1/2.\operatorname{\boldsymbol{\mathcal{L}}}\coloneqq\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{L}}\operatorname{\boldsymbol{D}}^{-1/2}=\operatorname{\boldsymbol{I}}_{n}-\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}.

It is well known that this matrix is symmetric, positive semi-definite, and admits an orthonormal basis of eigenvectors such that 𝓛𝐪i=λi𝐪i.\operatorname{\boldsymbol{\mathcal{L}}}\operatorname{\boldsymbol{q}}_{i}=\lambda_{i}\operatorname{\boldsymbol{q}}_{i}. Therefore, we may write

𝓛=𝐐𝚲𝐐T=i=1nλi𝐪i𝐪iT,\operatorname{\boldsymbol{\mathcal{L}}}=\operatorname{\boldsymbol{Q}}\operatorname{\boldsymbol{\Lambda}}\operatorname{\boldsymbol{Q}}^{T}=\sum_{i=1}^{n}\lambda_{i}\operatorname{\boldsymbol{q}}_{i}\operatorname{\boldsymbol{q}}_{i}^{T},

where 𝚲diag(λ1,,λn)\boldsymbol{\Lambda}\coloneqq\operatorname{diag}(\lambda_{1},\dots,\lambda_{n}) and 𝐐\operatorname{\boldsymbol{Q}} is the orthogonal matrix whose ii-th column is 𝐪i\operatorname{\boldsymbol{q}}_{i}.

We will use the eigendecomposition of 𝓛\operatorname{\boldsymbol{\mathcal{L}}} to define the graph Fourier transform, with the eigenvectors 𝐪1,,𝐪n\operatorname{\boldsymbol{q}}_{1},\dots,\operatorname{\boldsymbol{q}}_{n} being interpreted as Fourier modes. The notion of oscillation on irregular domains like graphs is delicate, but can be reframed in terms of increasing variation of the modes, with the eigenvalues 0=λ1λ2λn20=\lambda_{1}\leqslant\lambda_{2}\leqslant\dots\leqslant\lambda_{n}\leqslant 2 interpreted as (squared) frequencies.222This interpretation originates from motivating the graph Fourier transform via the combinatorial graph Laplacian 𝐋=i=1nλ~i𝐪~i𝐪~iT\operatorname{\boldsymbol{L}}=\sum_{i=1}^{n}\operatorname{\tilde{\lambda}}_{i}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}^{T} with λ~i=𝐪~iT𝐋𝐪~i=i=1n𝐖[i,j](𝐪~i[i]𝐪~i[j])2\operatorname{\tilde{\lambda}}_{i}=\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}^{T}\operatorname{\boldsymbol{L}}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}=\sum_{i=1}^{n}\operatorname{\boldsymbol{W}}[i,j](\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}[i]-\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}[j])^{2} the variation of 𝐪~i\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}. The Fourier transform of a graph signal 𝐱n\operatorname{\boldsymbol{x}}\in\operatorname{\mathbb{R}}^{n} is defined by 𝒙^[i]𝐱,𝐪i\boldsymbol{\hat{x}}[i]\coloneqq\langle\operatorname{\boldsymbol{x}},\operatorname{\boldsymbol{q}}_{i}\rangle for 1in1\leq i\leq n and the inverse Fourier transform may be computed by 𝐱=i=1n𝒙^[i]𝐪i\operatorname{\boldsymbol{x}}=\sum_{i=1}^{n}\boldsymbol{\hat{x}}[i]\operatorname{\boldsymbol{q}}_{i}. It will frequently be convenient to write these equations in matrix form as 𝒙^=𝐐T𝐱\boldsymbol{\hat{x}}=\operatorname{\boldsymbol{Q}}^{T}\operatorname{\boldsymbol{x}} and 𝐱=𝐐𝒙^\operatorname{\boldsymbol{x}}=\operatorname{\boldsymbol{Q}}\boldsymbol{\hat{x}}.

We recall that in the Euclidean setting, the convolution of two signals in the spatial domain corresponds to the pointwise multiplication of their Fourier transforms. Therefore, we may define the convolution of a signal 𝐱\operatorname{\boldsymbol{x}} with a filter 𝐠\operatorname{\boldsymbol{g}} by the rule that 𝐠𝐱\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}} is the unique vector such that (𝐠𝐱^)[i]=𝒈^[i]𝒙^[i](\widehat{\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}}})[i]=\boldsymbol{\hat{g}}[i]\boldsymbol{\hat{x}}[i] for 1in1\leq i\leq n. Applying the inverse Fourier transform, one may verify that

𝐠𝐱=i=1n𝒈^[i]𝒙^[i]𝐪i=i=1n𝒈^[i]𝒒i𝐪iT𝐱=𝐐𝑮^𝐐T𝐱,\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}}=\sum_{i=1}^{n}\boldsymbol{\hat{g}}[i]\boldsymbol{\hat{x}}[i]\operatorname{\boldsymbol{q}}_{i}=\sum_{i=1}^{n}\boldsymbol{\hat{g}}[i]\boldsymbol{q}_{i}\operatorname{\boldsymbol{q}}^{T}_{i}\operatorname{\boldsymbol{x}}=\operatorname{\boldsymbol{Q}}\boldsymbol{\widehat{G}}\operatorname{\boldsymbol{Q}}^{T}\operatorname{\boldsymbol{x}}, (1)

where 𝑮^diag(𝒈^)=diag(𝒈^[1],,𝒈^[n])\boldsymbol{\widehat{G}}\coloneqq\operatorname{diag}(\boldsymbol{\hat{g}})=\operatorname{diag}(\boldsymbol{\hat{g}}[1],\dots,\boldsymbol{\hat{g}}[n]). Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in 𝑮^\boldsymbol{\widehat{G}}.

3.2 Spectral Graph Neural Network Constructions

A graph filter is a function F:n×dXn×dYF:\mathbb{R}^{n\times d_{X}}\rightarrow\mathbb{R}^{n\times d_{Y}} that transforms a node feature matrix 𝐗n×dX\operatorname{\boldsymbol{X}}\in\mathbb{R}^{n\times d_{X}} into a new feature matrix 𝐘n×dY\operatorname{\boldsymbol{Y}}\in\mathbb{R}^{n\times d_{Y}}. GNNs typically feature several layers each of which produces a new set of features by filtering the output of the previous layer. We will usually let 𝐗0\operatorname{\boldsymbol{X}}^{0} denote the initial node feature matrix, which is the input to the network, and let 𝐗\operatorname{\boldsymbol{X}}^{\ell} denote the node feature matrix after the \ell-th layer.

In light of Eq. 1, a natural way to construct learned graph filters would be to directly learn the Fourier coefficients in 𝒈^\boldsymbol{\hat{g}}. Indeed this was the approach used in the pioneering work of Bruna et al. [27]. However, this approach has several drawbacks. First, it results in nn learnable parameters in each convolutional filter. Therefore, a network using such filters would not scale well to large data sets due to the computational cost. At the same time, such filters are not necessarily well-localized in space and are prone to overfitting [28]. Moreover, networks of the form introduced in [27] typically cannot be generalized to different graphs [1]. However, recent work [29] has shown that this latter issue can be overcome by formulating the Fourier coefficients 𝒈^[i]\boldsymbol{\hat{g}}[i] as smooth functions of the Laplacian eigenvalues λi\lambda_{i}, 1in1\leq i\leq n. In particular, this will be the case for the filters used in the networks considered in this work.

A common approach (e.g., used in [28, 5, 30, 31, 18]) to formulate such filters is by using polynomials of the Laplacian eigenvalues to set 𝒈^[i]j=0kθjλij\boldsymbol{\hat{g}}[i]\coloneqq\sum_{j=0}^{k}\theta_{j}\lambda_{i}^{j} (or equivalently 𝑮^=j=0kθj𝚲j\boldsymbol{\widehat{G}}=\sum_{j=0}^{k}\theta_{j}\boldsymbol{\Lambda}^{j}) for some knk\ll n. It can be verified [28] that this approach yields convolutional filters that are kk-localized in space and that can be written as 𝐠𝜽𝐱=j=0kθj𝓛j𝐱.\operatorname{\boldsymbol{g}}_{\operatorname{\boldsymbol{\theta}}}\star\operatorname{\boldsymbol{x}}=\sum_{j=0}^{k}\theta_{j}\operatorname{\boldsymbol{\mathcal{L}}}^{j}\operatorname{\boldsymbol{x}}. This reduces the number of trainable parameters in each filter from nn to k+1k+1 and allows one to perform convolution without explicitly computing the spectral decomposition of the Laplacian, which is expensive for large graphs.

One particularly noteworthy network that uses this method is [28], which writes the filters in terms of the Chebyshev polynomials defined by T0(x)=1T_{0}(x)=1, T1(x)=xT_{1}(x)=x and Tj(x)2xTj1(x)Tj2(x)T_{j}(x)\coloneqq 2xT_{j-1}(x)-T_{j-2}(x). They first renormalize the eigenvalue matrix 𝚲~2𝚲/λn𝐈n\boldsymbol{\tilde{\Lambda}}\coloneqq 2\operatorname{\boldsymbol{\Lambda}}/\lambda_{n}-\operatorname{\boldsymbol{I}}_{n} and then define 𝑮^j=0kθjTj(𝚲~)\boldsymbol{\widehat{G}}\coloneqq\sum_{j=0}^{k}\theta_{j}T_{j}(\boldsymbol{\tilde{\Lambda}}). Letting 𝓛~2𝓛/λn𝐈n\boldsymbol{\tilde{\mathcal{L}}}\coloneqq 2\boldsymbol{\mathcal{L}}/\lambda_{n}-\operatorname{\boldsymbol{I}}_{n}, yields

𝐠𝜽𝐱=j=0kθjTj(𝓛~).\operatorname{\boldsymbol{g}}_{\operatorname{\boldsymbol{\theta}}}\star\operatorname{\boldsymbol{x}}=\sum_{j=0}^{k}\theta_{j}T_{j}(\boldsymbol{\tilde{\mathcal{L}}}). (2)

3.3 Graph Convolutional Networks

One of the most widely used GNNs is the Graph Convolutional Network (GCN) [5]. This network is derived from the Chebyshev construction[28] mentioned above by setting k=1k=1 in Eq. 2 and approximating λn2\lambda_{n}\approx 2, which yields

𝐠θ0,θ1𝐱\displaystyle\operatorname{\boldsymbol{g}}_{\theta_{0},\theta_{1}}\star\operatorname{\boldsymbol{x}} θ0𝐱+θ1(𝓛𝐈n)𝐱\displaystyle\approx\theta_{0}\operatorname{\boldsymbol{x}}+\theta_{1}(\operatorname{\boldsymbol{\mathcal{L}}}-\operatorname{\boldsymbol{I}}_{n})\operatorname{\boldsymbol{x}}
=θ0𝐱θ1𝐃1/2𝐖𝐃1/2𝐱.\displaystyle=\theta_{0}\operatorname{\boldsymbol{x}}-\theta_{1}\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{x}}.

To further reduce the number of trainable parameters, the authors then set θθ0=θ1\theta\coloneqq\theta_{0}=-\theta_{1}. The resulting convolutional filter has the form

𝒈θ𝒙=θ(𝐈n+𝑫1/2𝑾𝑫1/2)𝒙.\boldsymbol{g}_{\theta}\star\boldsymbol{x}=\theta\left(\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}\right)\boldsymbol{x}. (3)

One may verify that 𝐈n+𝑫1/2𝑾𝑫1/2=2𝐈n𝓛\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}=2\operatorname{\boldsymbol{I}}_{n}-\boldsymbol{\mathcal{L}}, and therefore, Eq. 3 essentially corresponds to setting 𝒈^[i]θ(2λi)\boldsymbol{\hat{g}}[i]\coloneqq\theta(2-\lambda_{i}) in Eq. 1. The eigenvalues of 𝐈n+𝑫1/2𝑾𝑫1/2\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2} take values in [0,2].[0,2]. Thus, to avoid vanishing or exploding gradients, the authors use a renormalization trick

𝐈n+𝑫1/2𝑾𝑫1/2𝑫~1/2𝑾~𝑫~1/2,\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}\rightarrow\boldsymbol{\tilde{D}}^{-1/2}\boldsymbol{\tilde{W}}\boldsymbol{\tilde{D}}^{-1/2}, (4)

where 𝑾~𝐈n+𝑾\boldsymbol{\tilde{W}}\coloneqq\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{W} and 𝑫~\boldsymbol{\tilde{D}} is a diagonal matrix with 𝑫~[vi,vi]j=1n𝑾~[vi,vj]\boldsymbol{\tilde{D}}[v_{i},v_{i}]\coloneqq\sum_{j=1}^{n}\boldsymbol{\tilde{W}}[v_{i},v_{j}] for i[n]i\in[n]. Setting 𝑨𝑫~1/2𝑾~𝑫~1/2,\boldsymbol{A}\coloneqq\boldsymbol{\tilde{D}}^{-1/2}\boldsymbol{\tilde{W}}\boldsymbol{\tilde{D}}^{-1/2}, and using multiple channels we obtain a layer-wise propagation rule of the form 𝐱j=σ(i=1N1θij𝐀𝐱i1),\operatorname{\boldsymbol{x}}_{j}^{\ell}=\sigma\big{(}\sum_{i=1}^{N_{\ell-1}}\theta_{ij}^{\ell}\operatorname{\boldsymbol{A}}\operatorname{\boldsymbol{x}}_{i}^{\ell-1}\big{)}, where NN_{\ell} is the number of channels used in the \ell-th layer and σ()\sigma(\cdot) is an elementwise nonlinearity. In matrix form we write

𝐗=Fgcn(𝐗1)=σ(𝐀𝐗1𝚯).\operatorname{\boldsymbol{X}}^{\ell}=F_{\text{gcn}}\left(\operatorname{\boldsymbol{X}}^{\ell-1}\right)=\sigma\left(\operatorname{\boldsymbol{A}}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}\right). (5)

We interpret the matrix 𝐀\operatorname{\boldsymbol{A}} as computing a localized average of each channel 𝐱i1\operatorname{\boldsymbol{x}}_{i}^{\ell-1} around each mode and the matrix 𝚯\operatorname{\boldsymbol{\Theta}} as sharing information across channels. This filter can also be observed at the node level as

𝐗[v]=σ(w𝒩v¯1(dv+1)(dw+1)𝐗1[w]𝚯),\operatorname{\boldsymbol{X}}^{\ell}[v]=\sigma\left(\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\frac{1}{\sqrt{(d_{v}+1)(d_{w}+1)}}\operatorname{\boldsymbol{X}}^{\ell-1}[w]\operatorname{\boldsymbol{\Theta}}^{\ell}\right),

where 𝒩v\operatorname{\mathcal{N}}_{v} denotes the one-step neighborhood of node vv and 𝒩v¯𝒩v{v}\operatorname{\mathcal{N}}_{\underline{v}}\coloneqq\operatorname{\mathcal{N}}_{v}\cup\{v\}. This process can be split into three steps:

𝐗a[w]=𝐗1[w]𝚯 for all w𝒩v¯\displaystyle\operatorname{\boldsymbol{X}}_{a}^{\ell}[w]=\operatorname{\boldsymbol{X}}^{\ell-1}[w]\operatorname{\boldsymbol{\Theta}}^{\ell}\text{ for all }w\in\operatorname{\mathcal{N}}_{\underline{v}} (6a)
𝐗b[v]=w𝒩v¯1(dv+1)(dw+1)𝐗a[w]\displaystyle\operatorname{\boldsymbol{X}}_{b}^{\ell}[v]=\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\frac{1}{\sqrt{(d_{v}+1)(d_{w}+1)}}\operatorname{\boldsymbol{X}}_{a}^{\ell}[w] (6b)
𝐗c[v]=σ(𝐗b[v]),\displaystyle\operatorname{\boldsymbol{X}}_{c}^{\ell}[v]=\sigma\left(\operatorname{\boldsymbol{X}}_{b}^{\ell}[v]\right), (6c)

which we refer to as the transformation step (Eq. 6a), the aggregation step (Eq. 6b) and the activation step (Eq. 6c).

As discussed earlier, the GCN filter described above may be viewed as a low-pass filter that suppresses high-frequency information. For simplicity, we focus on the convolution in Eq. 3 before the renormalization. This convolution essentially corresponds to point-wise Fourier multiplication by 𝒈^[i]=θ(2λi)\boldsymbol{\hat{g}}[i]=\theta(2-\lambda_{i}), which is strictly decreasing in λi\lambda_{i}. Therefore, repeated applications of this filter effectively zero-out the higher frequencies. This is consistent with the oversmoothing problem discussed in [9].

3.4 Graph Attention Networks

Another popular network that is widely used for node classification tasks is the graph attention network (GAT) [17], which uses an attention mechanism to guide and adjust the aggregation of features from adjacent nodes. First, the node features 𝐗1\operatorname{\boldsymbol{X}}^{\ell-1} are linearly transformed to 𝑿¯𝐗1𝚯\boldsymbol{\bar{X}}^{\ell}\coloneqq\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}} using a learned weight matrix 𝚯d×d\operatorname{\boldsymbol{\Theta}}\in\mathbb{R}^{d\times d^{\prime}}. Then, the aggregation coefficients are learned via

αvu=exp(LeakyReLU([𝑿¯[v]𝑿¯[u]]𝐚)w𝒩v¯exp(LeakyReLU([𝑿¯[v]𝑿¯[w]]𝐚),\alpha_{v\leftarrow u}=\frac{\exp(\operatorname{LeakyReLU}({\left[\boldsymbol{\bar{X}}^{\ell}[v]\mathbin{\|}\boldsymbol{\bar{X}}^{\ell}[u]\right]\operatorname{\boldsymbol{a}}})}{\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\exp(\operatorname{LeakyReLU}(\left[\boldsymbol{\bar{X}}^{\ell}[v]\mathbin{\|}\boldsymbol{\bar{X}}^{\ell}[w]\right]\operatorname{\boldsymbol{a}})},

where 𝐚2d\operatorname{\boldsymbol{a}}\in\operatorname{\mathbb{R}}^{2d^{\prime}} is a shared attention vector and \| denotes horizontal concatenation. The output feature corresponding to a single attention head is given by 𝐗[v]=σ(u𝒩vαvu𝑿¯[u])\operatorname{\boldsymbol{X}}^{\ell}[v]=\sigma(\sum_{u\in\operatorname{\mathcal{N}}_{v}}\alpha_{v\leftarrow u}\boldsymbol{\bar{X}}^{\ell}[u]). To increase the expressivity of this network, the authors then use a multi-headed attention mechanism, with Γ\Gamma heads, to generate concatenated features

𝐗[v]=γ=1Γσ(u𝒩v¯αvuγ𝑿¯γ[u]).\textstyle\operatorname{\boldsymbol{X}}^{\ell}[v]=\mathbin{\|}_{\gamma=1}^{\Gamma}\sigma\left({\sum_{u\in\operatorname{\mathcal{N}}_{\underline{v}}}}\alpha_{v\leftarrow u}^{\gamma}\boldsymbol{\bar{X}}_{\gamma}^{\ell}[u]\right). (7)

3.5 Challenges in Geometric Deep Learning

Many GNN models, including GCN [5] and GAT [17], are subject to the so-called oversmoothing problem [9], caused by aggregation steps (such as Eq. 6b) that essentially consist of localized averaging operations. As discussed in Sec. 3.3 and also [10], from a signal processing point of view, this corresponds to a low-pass filtering of graph signals. Moreover, as discussed in [32], these networks are also subject to underreaching. Most GNNs (including GCN and GAT) can only relate information from nodes within a distance equal to the number of GNN layers, and because of the oversmoothing problem, they typically use a small number of layers in practice. Therefore, the oversmoothing and underreaching problems combine to significantly limit the ability of GNNs to capture long-range dependencies. In Sec. 4, we will introduce a hybrid network, which aims to address these challenges by using both GCN-style channels and channels based on the geometric scattering transform discussed below.

3.6 Geometric Scattering

In this section, we review the geometric scattering transform constructed in [12] for graph classification and show how it may be adapted for node-level tasks. As we shall see, this node-level geometric scattering will address the challenges discussed above in Sec. 3.5, by using band-pass filters that capture high-frequency information and have wider receptive fields.

The geometric scattering transform uses wavelets based upon raising the lazy random walk matrix

𝑷12(𝐈n+𝑾𝑫1),\boldsymbol{P}\coloneqq\frac{1}{2}\big{(}\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{W}\boldsymbol{D}^{-1}\big{)},

to dyadic powers 2j2^{j}, which can be interpreted as differing degrees of resolution. The right eigenvectors and eigenvalues of 𝐏\operatorname{\boldsymbol{P}} are 𝐮i𝐃1/2𝐪i\operatorname{\boldsymbol{u}}_{i}\coloneqq\operatorname{\boldsymbol{D}}^{1/2}\operatorname{\boldsymbol{q}}_{i} and ωi1λi/2,1in\omega_{i}\coloneqq 1-\lambda_{i}/2,1\leq i\leq n, respectively. Entrywise, we note that

(𝐏𝐗)[v]=12𝐗[v]+12w𝒩vdw1𝐗[w].(\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{X}})[v]=\frac{1}{2}\operatorname{\boldsymbol{X}}[v]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{v}}d_{w}^{-1}\operatorname{\boldsymbol{X}}[w]. (8)

Thus, 𝐏\operatorname{\boldsymbol{P}} may be viewed as a localized averaging operation operator analogous to those used in, e.g., GCN, and the powers 𝐏2j\operatorname{\boldsymbol{P}}^{2^{j}} may be viewed as low-pass filters which suppress high-frequencies. In order to better retain this high-frequency information, we define multiscale graph diffusion wavelets by subtracting these low-pass filters at different scales [16]. Specifically, for k0k\in\mathbb{N}_{0}, we define a wavelet 𝚿kn×n\boldsymbol{\Psi}_{k}\in\mathbb{R}^{n\times n} at scale 2k2^{k} by

𝚿0𝐈n𝐏,𝚿k𝐏2k1𝐏2k,k1.\operatorname{\boldsymbol{\Psi}}_{0}\coloneqq\operatorname{\boldsymbol{I}}_{n}-\operatorname{\boldsymbol{P}},\quad\operatorname{\boldsymbol{\Psi}}_{k}\coloneqq\operatorname{\boldsymbol{P}}^{2^{k-1}}-\operatorname{\boldsymbol{P}}^{2^{k}},\quad k\geq 1. (9)

We may interpret each 𝚿k\boldsymbol{\Psi}_{k} as capturing information at a different frequency band. From a spatial perspective, we may view 𝚿k\operatorname{\boldsymbol{\Psi}}_{k} as encoding information on how a 2k2^{k}-step neighborhood differs from a smaller 2k12^{k-1}-step one. Such wavelets are usually organized in a filter bank

{𝚿k,𝚽K}0kK,\{\boldsymbol{\Psi}_{k},\boldsymbol{\Phi}_{K}\}_{0\leq k\leq K}, (10)

along with a low-pass filter 𝚽K𝑷2K\boldsymbol{\Phi}_{K}\coloneqq\boldsymbol{P}^{2^{K}}. Proposition 2.2 of  [22] (restated in here as Proposition 1) shows that this filter bank is a self-adjoint, non-expansive frame on a suitably weighted inner-product space.

Refer to caption
Figure 1: Illustration of two-layer geom. scattering at the node level (𝑼(𝒙)={𝑼p𝒙:p0m,m=0,1,2}\boldsymbol{U}(\boldsymbol{x})=\{\boldsymbol{U}_{p}\boldsymbol{x}:p\in\mathbb{N}_{0}^{m},m=0,1,2\}) and at the graph level (𝑺(𝒙)={𝑺p,q𝒙:q,p0m,m=0,1,2}\boldsymbol{S}(\boldsymbol{x})=\{\boldsymbol{S}_{p,q}\boldsymbol{x}:q\in\mathbb{N},p\in\mathbb{N}_{0}^{m},m=0,1,2\}), extracted according to the wavelet cascade in Eq. 9-12.

The geometric scattering transform is a multi-layered architecture that iteratively applies wavelet convolutions and nonlinear activation functions in an alternating cascade as illustrated in Fig. 1. It is parameterized by paths p(k1,,km)p\coloneqq(k_{1},\dots,k_{m}). Formally, we define

𝑼p𝒙𝚿kmσ𝚿km1σ𝚿k2σ𝚿k1𝒙,\boldsymbol{U}_{p}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{k_{m}}\circ\sigma\circ\boldsymbol{\Psi}_{k_{m-1}}\cdots\sigma\circ\boldsymbol{\Psi}_{k_{2}}\circ\sigma\circ\boldsymbol{\Psi}_{k_{1}}\boldsymbol{x}, (11)

where σ\sigma is a nonlinear activation function.333In a slight deviation from previous work, here 𝑼p\boldsymbol{U}_{p} does not include the outermost nonlinearity in the cascade. We note that in other work focusing on graph-classification, the scattering features are frequently aggregated into qthq^{\text{th}}-order moments,

𝑺p,q𝒙i=1n|(𝐔p𝐱)[vi]|q.\textstyle\boldsymbol{S}_{p,q}\boldsymbol{x}\coloneqq\sum_{i=1}^{n}|(\operatorname{\boldsymbol{U}}_{p}\operatorname{\boldsymbol{x}})[v_{i}]|^{q}. (12)

We also note that the nonlinearity σ\sigma might vary in each step of the cascade. However, we will suppress this possible dependence to avoid cumbersome notation. We also note that in our theoretical results, if we assume, for example, that σ\sigma is strictly increasing, this assumption is intended to apply to all nonlinearities used in the cascade. In our experiments, we use the absolute value, i.e., σ(𝐱[vi])=|𝐱[vi]|\sigma(\operatorname{\boldsymbol{x}}[v_{i}])=\lvert\operatorname{\boldsymbol{x}}[v_{i}]\rvert.

The original formulations of geometric scattering were fully designed networks without any learned convolutions between channels. Here, we will incorporate learning by defining the following scattering propagation rule similar to the one used by GCN:

𝐗=Fp-scat(𝐗1)=σ(𝐔p(𝐗1𝚯)).\operatorname{\boldsymbol{X}}^{\ell}=F_{p\text{-scat}}\left(\operatorname{\boldsymbol{X}}^{{\ell-1}}\right)=\sigma\left(\operatorname{\boldsymbol{U}}_{p}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}\big{)}\right). (13)

Analogously to GCN, we note that for vVv\in V, the scattering propagation rule can be decomposed into three steps:

𝐗a[v]=𝐗1[v]𝚯\displaystyle\operatorname{\boldsymbol{X}}_{a^{\prime}}^{\ell}[v]=\operatorname{\boldsymbol{X}}^{\ell-1}[v]\operatorname{\boldsymbol{\Theta}}^{\ell} (14a)
𝐗b[v]=(𝐔p(𝐗a))[v]\displaystyle\operatorname{\boldsymbol{X}}_{b^{\prime}}^{\ell}[v]=\left(\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}}_{a^{\prime}}^{\ell})\right)[v] (14b)
𝐗c[v]=σ(𝐗b[v])q.\displaystyle\operatorname{\boldsymbol{X}}_{c^{\prime}}^{\ell}[v]=\sigma(\operatorname{\boldsymbol{X}}_{b^{\prime}}^{\ell}[v])^{q}. (14c)

Importantly, we note that the scattering transform addresses the underreaching problem as wavelets 𝚿k=𝐏2k1𝐏2k\boldsymbol{\Psi}_{k}=\operatorname{\boldsymbol{P}}^{2^{k-1}}-\operatorname{\boldsymbol{P}}^{2^{k}} that are leveraged in the aggregation step (Eq. 14b) have larger receptive fields than most traditional GNNs. However, scattering does not result in oversmoothing because the subtraction results in band-pass filters rather than low-pass filters. In this manner, the scattering transform addresses the challenges discussed in the previous subsection.

4 Hybrid Scattering Networks

Here, we introduce two hybrid networks that combine aspects of GCN and the geometric scattering transform discussed in the previous section. Our networks will use both low-pass and band-pass filters in different channels to capture different types of information. As a result, our hybrid networks will have greater expressive power than either traditional GCNs, which only use low-pass filters, or a pure scattering network, which only uses band-pass filters.

In our low-pass channels, we use modified GCN filters, which are similar to those used in Eq. 5 but use higher powers of 𝐀\operatorname{\boldsymbol{A}} and include bias terms. Specifically, we use a channel update rule of the form

𝐗low,iσ(𝐀ri𝐗1𝚯low,i+𝐁low,i)\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\coloneqq\sigma\left(\operatorname{\boldsymbol{A}}^{r_{i}}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}_{\text{low},i}^{\ell}+\operatorname{\boldsymbol{B}}_{\text{low},i}^{\ell}\right) (15)

for 1iClow1\leq i\leq C_{\text{low}}. We note, in particular, that the use of higher powers of 𝐀\operatorname{\boldsymbol{A}} enables a wider receptive field (of radius rir_{i}), without increasing the number of trainable parameters (unlike in GCN).

Similarly, in our band-pass channels, we use a version of Eq. 13, with an added bias term, and our update rule is

𝐗band,iσ(𝐔pi(𝐗1𝚯band,i)+𝐁band,i)\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\coloneqq\sigma\left(\operatorname{\boldsymbol{U}}_{p_{i}}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}_{\text{band},i}^{\ell}\big{)}+\operatorname{\boldsymbol{B}}_{\text{band},i}^{\ell}\right) (16)

for 1iCband1\leq i\leq C_{\text{band}}. Here, similarly to Eq. 11, pip_{i} is a path that determines the cascade of wavelets used in the ii-th channel.

Aggregation module. Each hybrid layer uses ClowC_{\text{low}} low-pass and CbandC_{\text{band}} band-pass channels, described in Eq. 15 and Eq. 16, to transform the node features 𝐗1\operatorname{\boldsymbol{X}}^{\ell-1}. The resulting {𝐗low,i}i=1Clow and {𝐗band,i}i=1Cband\{\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\}_{i=1}^{C_{\text{low}}}\text{ and }\{\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\}_{i=1}^{C_{\text{band}}} are aggregated to new dd_{\ell}-dimensional representations 𝐗\operatorname{\boldsymbol{X}}^{\ell} via an aggregation module such as those discussed in Sections 4.1 and 4.2.

𝐗AGG({𝐗low,i}i=1Clow,{𝐗band,i}i=1Cband).\operatorname{\boldsymbol{X}}^{\ell}\coloneqq\operatorname{AGG}\left(\{\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\}_{i=1}^{C_{\text{low}}},\{\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\}_{i=1}^{C_{\text{band}}}\right). (17)

Graph Residual Convolution. After aggregating the outputs of the low-pass channels and band-pass channels in Eq. 17, we apply the graph residual convolution, which acts as a low-pass filtering and aims to eliminate any high-frequency noise introduced by the scattering channels. This noise can arise, e.g., if there are various different label rates in different graph substructures. In this case, the scattering features may learn the difference between labeled and unlabeled nodes and thereby produce high-frequency noise.

This filter uses a modified diffusion matrix given by

𝑨res(α)=1α+1(𝑰n+α𝑾𝑫1),\boldsymbol{A}_{\text{res}}(\alpha)=\frac{1}{\alpha+1}(\boldsymbol{I}_{n}+\alpha\boldsymbol{W}\boldsymbol{D}^{-1}),

where the hyperparameter α\alpha determines the magnitude of the low-pass filtering. Choosing α=0\alpha=0 yields the identity (no filtering), while α\alpha\rightarrow\infty results in the random walk matrix 𝑹𝑾𝑫1\boldsymbol{R}\coloneqq\boldsymbol{W}\boldsymbol{D}^{-1}. Thus, 𝐀res\operatorname{\boldsymbol{A}}_{\text{res}} can be interpreted as lying between a completely lazy random walk that never moves and a non-resting one 𝑹\boldsymbol{R} that moves at every time step.

The full residual convolution update rule is given by

𝐗+1𝐀res(α)𝐗𝚯res+𝐁res.\operatorname{\boldsymbol{X}}^{\ell+1}\coloneqq\operatorname{\boldsymbol{A}}_{\text{res}}(\alpha)\operatorname{\boldsymbol{X}}^{\ell}\operatorname{\boldsymbol{\Theta}}_{\text{res}}+\operatorname{\boldsymbol{B}}_{\text{res}}.

The multiplication with 𝚯res\operatorname{\boldsymbol{\Theta}}_{\text{res}} corresponds to a fully connected layer applied to the output from Eq. 17 (after filtering by 𝐀res\operatorname{\boldsymbol{A}}_{\text{res}}) with each neuron learning a linear combination of the signals output by the aggregation module.

4.1 Scattering GCN

The Scattering GCN (Sc-GCN) network was first introduced in [33]. Here, the aggregation module concatenates the filter responses horizontally yielding wide node representations 𝐗\operatorname{\boldsymbol{X}}^{\ell} of the form

[𝐗low,1𝐗low,Clow𝐗band,1𝐗band,Cband].[\operatorname{\boldsymbol{X}}_{\text{low},1}^{\ell}\mathbin{\|}\dots\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{low},C_{\text{low}}}^{\ell}\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{band},1}^{\ell}\mathbin{\|}\dots\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{band},C_{\text{band}}}^{\ell}]. (18)

Sc-GCN then learns relationships between the channels via the graph residual convolution.

Refer to caption
Figure 2: (a) & (b) comparison between GCN and our Sc-GCN: we add band-pass channels to collect different frequency components; (c) graph residual convolution layer; (d) Sc-GCN combines five network channels, followed by a graph residual convolution.

Configuration. The primary goal of Sc-GCN is to alleviate oversmoothing in popular semi-supervised node-level tasks on, e.g., citation networks. As regularity patterns in such datasets are often dominated by low-frequency information such as inter-cluster node-similarities, we choose our parameters to focus on low-frequency information. We use three low-pass filters, with receptive fields of size (or radius) rk=1,2,3r_{k}=1,2,3, and two band-pass filters. We use σ()||\sigma(\cdot)\coloneqq|\cdot| as our nonlinearity in all steps except the outermost nonlinearity. Inspired by the aggregation step in classical geometric scattering[12], for the outermost nonlinearity, we additionally apply the qthq^{\text{th}} power at the node level, i.e., σ()||q\sigma(\cdot)\coloneqq|\cdot|^{q}.

The paths pp and the parameter α\alpha from the graph residual convolution are tuned as hyperparameters of the network.

4.2 Geometric Scattering Attention Network

An important observation in Sc-GCN above is that the model decides globally about how to combine different channel information. The network first concatenates all of the features from the low-pass and band-pass channels in Eq. 18 and then combines these features via multiplication with the weight matrix 𝚯res\operatorname{\boldsymbol{\Theta}}_{\text{res}}. However, for complex tasks or datasets, important regularity patterns may vary significantly across different graph regions. In such settings, a model should ideally attend locally over the aggregation and adaptively combine filter responses at different nodes.

This observation inspired the design of the Geometric Scattering Attention Network (GSAN) [34]. Drawing inspiration from  [35], GSAN uses an aggregation module based on a multi-head node-level attention framework. However, the attention mechanism used in GSAN differs from [35] by attending over the combination of the different filter responses rather than over the combination of node features from neighboring nodes. We will first focus on the processing performed independently by each attention head and then discuss a multi-head configuration.

Refer to caption
Figure 3: Illustration of the proposed scattering attention layer.

In a slight deviation from Sc-GCN, the weight matrices in Eq. 15 and Eq. 16 are shared across the filters of each attention head. Therefore, both aggregation steps (Eq. 6b and Eq. 14b) take the same input denoted by 𝑿¯𝐗1𝚯\boldsymbol{\bar{X}}^{\ell}\coloneqq\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}. Next, we define 𝑿¯low,i𝐀ri𝑿¯\boldsymbol{\bar{X}}_{\text{low},i}\coloneqq\operatorname{\boldsymbol{A}}^{r_{i}}\boldsymbol{\bar{X}}^{\ell} and 𝑿¯band,i|𝐔pi(𝑿¯)|\boldsymbol{\bar{X}}_{\text{band},i}\coloneqq|\operatorname{\boldsymbol{U}}_{p_{i}}(\boldsymbol{\bar{X}}^{\ell})| to be the outputs of the aggregation steps, Eq. 6b and Eq. 14b, with the bias terms set to zero. We then compute attention score vectors 𝐞low,i,𝐞band,in\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell},\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell}\in\operatorname{\mathbb{R}}^{n} that will be used to determine the importance of each of the filter responses for every single node. We calculate

𝐞low,i\displaystyle\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell} =LeakyReLU([𝑿¯𝑿¯low,i]𝐚),\displaystyle=\operatorname{LeakyReLU}\left(\left[\boldsymbol{\bar{X}}^{\ell}\|\boldsymbol{\bar{X}}_{\text{low},i}^{\ell}\right]\operatorname{\boldsymbol{a}}\right),

with analogous 𝐞band,i\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell} and 𝐚2d\operatorname{\boldsymbol{a}}\in\operatorname{\mathbb{R}}^{2d_{\ell}} being a learned attention vector that is shared across all filters of the attention head. These attention scores are then normalized across all filters using the softmax function. Specifically, we define

𝜶low,j\displaystyle\operatorname{\boldsymbol{\alpha}}_{\text{low},j}^{\ell} exp(𝐞low,j)i=1Clowexp(𝐞low,i)+i=1Cbandexp(𝐞band,i),\displaystyle\coloneqq\frac{\exp(\operatorname{\boldsymbol{e}}_{\text{low},j}^{\ell})}{\sum_{i=1}^{C_{\text{low}}}\exp(\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell})+\sum_{i=1}^{C_{\text{band}}}\exp(\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell})},

where the exponential function is applied elementwise, and define 𝜶band,j\operatorname{\boldsymbol{\alpha}}_{\text{band},j}^{\ell} analogously. Finally, for every node, the filter responses are summed together, weighted by the corresponding (node-to-filter) attention score. We also normalize by the number of filters CClow+CbandC\coloneqq C_{\text{low}}+C_{\text{band}}, which gives

𝐗=C1σ~(\displaystyle\operatorname{\boldsymbol{X}}^{\ell}=C^{-1}\tilde{\sigma}\bigg{(} j=1Clow𝜶low,j𝑿¯low,j+j=1Cband𝜶band,j𝑿¯band,j),\displaystyle\sum_{j=1}^{C_{\text{low}}}\operatorname{\boldsymbol{\alpha}}_{\text{low},j}^{\ell}\odot\boldsymbol{\bar{X}}_{\text{low},j}^{\ell}+\sum_{j=1}^{C_{\text{band}}}\operatorname{\boldsymbol{\alpha}}_{\text{band},j}^{\ell}\odot\boldsymbol{\bar{X}}_{\text{band},j}^{\ell}\bigg{)},

where 𝜶𝐗diag(𝜶)𝐗\boldsymbol{\alpha}\odot\operatorname{\boldsymbol{X}}\coloneqq\operatorname{diag}(\boldsymbol{\alpha})\operatorname{\boldsymbol{X}} denotes the Hadamard (elementwise) product of 𝜶\boldsymbol{\alpha} with each column of 𝐗\operatorname{\boldsymbol{X}}. We further use σ~()=ReLU()\tilde{\sigma}(\cdot)=\text{ReLU}(\cdot) in the equation above.

Multi-head Attention. Similar to other applications of attention mechanisms [35], we use multi-head attention to stabilize the training, thus yielding as output of the aggregation module (by a slight abuse of notation)

𝐗=γ=1Γ𝐗γ(𝚯γ,𝜶γ),\operatorname{\boldsymbol{X}}^{\ell}=\|_{\gamma=1}^{\Gamma}\operatorname{\boldsymbol{X}}_{\gamma}^{\ell}\left(\operatorname{\boldsymbol{\Theta}}_{\gamma}^{\ell},\operatorname{\boldsymbol{\alpha}}_{\gamma}^{\ell}\right), (19)

concatenating Γ\Gamma attention heads. As explained above, each attention head has individually trained parameters 𝚯γ\operatorname{\boldsymbol{\Theta}}_{\gamma}^{\ell} and 𝜶γ\operatorname{\boldsymbol{\alpha}}_{\gamma}^{\ell} and outputs a filter response 𝐗γ\operatorname{\boldsymbol{X}}_{\gamma}^{\ell}. Similar to Sc-GCN, this concatenation results in wide node representations that are further refined by the graph residual convolution.

Configuration. For GSAN, we set Clow=Cband=3C_{\text{low}}=C_{\text{band}}=3, giving the model balanced access to both low-pass and band-pass information. The aggregation process of the attention layer is shown in Fig. 3, where 𝑼1,2,3\boldsymbol{U}_{1,2,3} represent three first-order scattering transformations with 𝑼1𝒙𝚿1x\boldsymbol{U}_{1}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{1}x, 𝑼2𝒙𝚿2x\boldsymbol{U}_{2}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{2}x and 𝑼3𝒙𝚿3x\boldsymbol{U}_{3}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{3}x. The number of attention heads Γ\Gamma and the parameter α\alpha from the graph residual convolution are tuned as hyperparameters of the network.

5 Theory of Hybrid Scattering Networks

In this section, we compare the descriptive power of the scattering transform to a class of GNNs defined below, which we refer to as aggregate-combine GNNs (AC-GNNs). We note that in practice, AC-GNNs are typically implemented with relatively narrow receptive fields in order to avoid oversmoothing. However, this leads to the underreaching problem, as discussed in Sections 1 and 3.5. The scattering transform, on the other hand, is able to utilize a large receptive field while avoiding oversmoothing. Therefore, it is able to simultaneously overcome both the oversmoothing and underreaching problems.

The family of AC-GNNs includes many popular GNN architectures such as GCN [5], GIN [36] and GraphSAGE [4]. However, importantly, we note that it does not include the scattering transform (because AC-GNNs only use local – one-step neighborhood – aggregations). We also note that this definition of an AC-GNN is quite similar to analogous notions used in, e.g., [32, 36].

Definition 1 (Aggregate-Combine GNN).

An aggregate-combine GNN (AC-GNN) is a network, where the update rule for each layer is defined on the node level according to

Fac(𝐗)vσ(com(f(𝐗v,dv),aggw𝒩vg(𝐗w,dv,dw))),F_{\text{ac}}(\operatorname{\boldsymbol{X}})_{v}\coloneqq\sigma\left(\operatorname{com}\left(f(\operatorname{\boldsymbol{X}}_{v},d_{v}),\operatorname{agg}_{w\in\operatorname{\mathcal{N}}_{v}}g(\operatorname{\boldsymbol{X}}_{w},d_{v},d_{w})\right)\right), (20)

for arbitrary functions f:1×d×1×df:\operatorname{\mathbb{R}}^{1\times d}\times\operatorname{\mathbb{N}}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}},
g:1×d×21×dg:\operatorname{\mathbb{R}}^{1\times d}\times\operatorname{\mathbb{N}}^{2}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}}, com:1×d×1×d1×d\operatorname{com}:\operatorname{\mathbb{R}}^{1\times d^{\prime}}\times\operatorname{\mathbb{R}}^{1\times d^{\prime}}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}}, σ:\sigma:\operatorname{\mathbb{R}}\rightarrow\operatorname{\mathbb{R}} an (optional) activation function applied elementwise, and agg\operatorname{agg} any function mapping a multi-set of vectors in 1×d\operatorname{\mathbb{R}}^{1\times d^{\prime}} (i.e., {{𝐚i1×d}}i\operatorname{\{\!\!\{}\operatorname{\boldsymbol{a}}_{i}^{\top}\in\operatorname{\mathbb{R}}^{1\times d^{\prime}}\operatorname{\}\!\!\}}_{i\in\mathcal{I}}) to a vector in 1×d\operatorname{\mathbb{R}}^{1\times d^{\prime}}.

As discussed in Section 3.3, the aggregations used in many popular AC-GNNs effectively act as localized averaging operators and focus primarily on low-frequency information. Therefore, deep AC-GNNs have the undesirable property of oversmoothing the input features if too many layers are used. As a result, most AC-GNNs typically only use two or three layers in order to avoid the oversmoothing problem. To understand this at an intuitive level, we follow the lead of [37] and consider a simplified version of GCN in which σ\sigma is the identity and the weight matrix is given by 𝚯=12𝐈n\operatorname{\boldsymbol{\Theta}}=\frac{1}{2}\operatorname{\boldsymbol{I}}_{n} and also consider the network without the renormalization trick. In this case, the filter used in GCN is given by 12(𝐈n+𝐃1/2𝐖𝐃1/2)\frac{1}{2}\left(\operatorname{\boldsymbol{I}}_{n}+\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\right) and so we have 𝒈^[i]=1λi/2\boldsymbol{\hat{g}}[i]=1-\lambda_{i}/2 in Eq. 1. This implies that if 𝐱\operatorname{\boldsymbol{x}} is any vector which is orthogonal to the lead eigenvector, we have that the representation of 𝐱\operatorname{\boldsymbol{x}} after \ell layers satisfies

12(𝐈n+𝐃1/2𝐖𝐃1/2)𝐱2(1λ22)𝐱2.\left\|\frac{1}{2}\left(\operatorname{\boldsymbol{I}}_{n}+\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\right)^{\ell}\operatorname{\boldsymbol{x}}\right\|_{2}\leq\left(1-\frac{\lambda_{2}}{2}\right)^{\ell}\|\operatorname{\boldsymbol{x}}\|_{2}.

Therefore, the output of a deep (simplified) GCN will converge exponentially fast to the projection of 𝐱\operatorname{\boldsymbol{x}} onto the bottom eigenspace, which is why GCNs with many layers typically achieve poor performance.

By contrast, the use of band-pass filters allows to incorporate a larger receptive field without essentially projecting the input features onto the bottom eigenspace. Consider for example the following result from [22].

Proposition 1 (Proposition 2.2 of [22]).

The wavelet filter bank 𝒲={𝚿k,𝚽K}k=0K\mathcal{W}=\{\operatorname{\boldsymbol{\Psi}}_{k},\operatorname{\boldsymbol{\Phi}}_{K}\}_{k=0}^{K} introduced in Eq. 10 is a non-expansive frame with respect to the weighted norm defined by 𝐱𝐃1/2i=1n|𝐱[i]|2dvi1\|\operatorname{\boldsymbol{x}}\|_{\operatorname{\boldsymbol{D}}^{-1/2}}\coloneqq\sum_{i=1}^{n}|\operatorname{\boldsymbol{x}}[i]|^{2}d_{v_{i}}^{-1} whose lower-frame bound is a universal constant independent of JJ and the graph topology. That is, there exists a universal constant c1>0c_{1}>0 such that

c1𝐱𝐃1/22k=0K\displaystyle c_{1}\|\operatorname{\boldsymbol{x}}\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\sum_{k=0}^{K} 𝚿k𝐱𝐃1/22\displaystyle\|\operatorname{\boldsymbol{\Psi}}_{k}\operatorname{\boldsymbol{x}}\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}
+\displaystyle+ 𝚽K𝐱𝐃1/22𝐱𝐃1/22.\displaystyle\|\operatorname{\boldsymbol{\Phi}}_{K}\operatorname{\boldsymbol{x}}\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\|\operatorname{\boldsymbol{x}}\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}.

The fact that the lower bound c1c_{1} does not depend on JJ is important because the receptive field of an LL-layer geometric scattering transform is 2JL2^{J}L. Therefore, if we choose JJ to be sufficiently large, the scattering transform as formulated in [22] is able to incorporate long-range interaction in the network while still preserving a substantial portion of the input signal energy. Our construction differs slightly from [22] in that our wavelet filter bank includes only the 𝚿k,\operatorname{\boldsymbol{\Psi}}_{k}, but not 𝚽K\operatorname{\boldsymbol{\Phi}}_{K}. However, one may derive a result similar to Proposition 1, but where the lower bound c1c_{1} depends on λ2\lambda_{2} (but still does not depend on JJ). We refer the reader to the proof of Proposition 4.1 of [38] for details.444While [38] does not consider weighted norms, this technicality can be readily handled by imitating the proof of Proposition 2.2 of [22].. Unlike the scattering transform, in order to avoid oversmoothing, most AC-GNNs typically use networks with small receptive fields. While this does help avoid the problem of oversmoothing, it unfortunately creates the problem of underreaching. In the remainder of this section, we will focus on demonstrating that this underreaching diminishes the power of the network to discriminate different nodes in certain situations.

We will now begin our analysis of node discriminability555Formal definitions are provided in Appendix. A.1, , i.e., the ability of a network to produce different representations of two nodes vv and ww in its hidden layers in the context of the underreaching problem. We will let 𝒩v¯K\operatorname{\mathcal{N}}_{\underline{v}}^{K} denote the KK-step node neighborhood5 of a node vv (including vv itself), and for VSVV_{S}\subset V we will let G(VS)G(V_{S}) denote the corresponding induced subgraph.5 We will say two induced subgraphs G(VS)G(V_{S}) and G(VS)G(V_{S^{\prime}}) are isomorphic5 if they have identical geometric structure and write G(VS)ϕG(VS)G(V_{S})\cong^{\phi}G(V_{S^{\prime}}) to indicate that ϕ\phi is an isometry (geometry preserving map) between them. In the definition below, we introduce a class of graph-intrinsic node features that encode graph topology information. We will use these features as the input for GNN models in order to produce geometrically informed node representations.

Definition 2 (Intrinsic Node Features).

A nonzero node feature matrix 𝐗\operatorname{\boldsymbol{X}} is KK-intrinsic if for any v,wVv,w\in V such that G(𝒩v¯K)G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K}\right) is isomorphic to G(𝒩w¯K)G\left(\operatorname{\mathcal{N}}_{\underline{w}}^{K}\right), we have 𝐗[v]=𝐗[w]\operatorname{\boldsymbol{X}}[v]=\operatorname{\boldsymbol{X}}[w].

These intrinsic node features encode important KK-local geometric information in the sense that if 𝐗[v]𝐗[w]\operatorname{\boldsymbol{X}}[v]\neq\operatorname{\boldsymbol{X}}[w], then the KK-step neighborhoods of vv and ww must have different geometric structure. To further understand this definition, we observe that the degree vector, or any (elementwise) function of it, is a one-intrinsic node feature matrix. Setting 𝐗[v]\operatorname{\boldsymbol{X}}[v] to the average node degree of nodes in 𝒩v¯K1\operatorname{\mathcal{N}}_{\underline{v}}^{K-1} yields KK-intrinsic node features. As a slightly more complicated example, features with 𝐗[v]\operatorname{\boldsymbol{X}}[v] the number of triangles contained in 𝒩v¯K\operatorname{\mathcal{N}}_{\underline{v}}^{K} are also KK-intrinsic. Fig. 4 illustrates how such node features can help distinguish nodes from different graph structures.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Intrinsic node features for two graphs (top/bottom). We compare nodes from the isomorphic horizontal path that is part of both graphs. Color coding indicates different node feature values and differs in (c). Node features are one-intrinsic (degrees) in (a) and (b), and two-intrinsic (average degree in one-step neighborhoods) in (c). Gray area represents subgraph used for the feature assignment at the indicated node.

As a concrete example, consider the task of predicting traffic on a network of Wikipedia articles, such as the Chameleon dataset [39], with nodes and edges corresponding to articles and hyperlinks. Here, intrinsic node features provide insights into the topological context of each node. A high degree suggests a widely cited article, which is likely to have a lot of traffic, while counting kk-cliques or considering average degrees from nearby nodes can shed light on the density of the local graph region.

The following theorem characterizes situations where AC-GNNs cannot discriminate nodes based on the information represented in intrinsic node features. It can be seen as a formal formulation of the underreaching problem [32]. We note that an analogous result can also be established for graph attention networks, e.g., GAT [35] and GATv2 [40].

Theorem 1.

Let ϕ:VV\phi:V\rightarrow V, vVv\in V, and K,LK,L\in\mathbb{N} such that G(𝒩v¯K+L)ϕG(𝒩ϕ(v)¯K+L)G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}. Let 𝐗0\operatorname{\boldsymbol{X}}^{0} be any KK-intrinsic node feature matrix and let 𝐗L\operatorname{\boldsymbol{X}}^{L} be the output of an LL-layer AC-GNN, i.e., FacLFac1F_{\text{ac}}^{L}\circ\dots\circ F_{\text{ac}}^{1}, with each layer as in Eq. 20. Then, 𝐗L[v]=𝐗L[ϕ(v)]\operatorname{\boldsymbol{X}}^{L}[v]=\operatorname{\boldsymbol{X}}^{L}[\phi(v)], and so vv and ϕ(v)\phi(v) cannot be discriminated based on the filtered node features 𝐗L\operatorname{\boldsymbol{X}}^{L} from an LL-layer AC-GNN.

Theorem 1 is proved by using an induction argument to show that 𝐗u=𝐗ϕ(u)\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)} for all u𝒩v¯Lu\in\operatorname{\mathcal{N}}_{\underline{v}}^{L-\ell} and all 0L0\leq\ell\leq L. Full details can be found in the appendix A.2.

Next, we introduce a notion of structural difference that allows us to analyze situations where networks with wide receptive fields such as scattering networks have strictly greater discriminatory power than networks with smaller receptive fields.

Definition 3 (Structural Difference).

For two ϕ\phi-isomorphic induced subgraphs G(VS)ϕG(VS)G(V_{S})\cong^{\phi}G(V_{S^{\prime}}), a structural difference relative to ϕ\phi and a node feature matrix 𝐗\operatorname{\boldsymbol{X}} is manifested at uVSu\in V_{S} if 𝐗[u]𝐗[ϕ(u)]\operatorname{\boldsymbol{X}}[u]\neq\operatorname{\boldsymbol{X}}[\phi(u)] or dudϕ(u)d_{u}\neq d_{\phi(u)}.666Note that dudϕ(u)d_{u}\neq d_{\phi(u)} is only relevant for nodes in the boundary5 S\partial S because dw=dϕ(w)d_{w}=d_{\phi(w)} for all interior5 nodes wint(S)w\in\operatorname{int}(S), as the degree is 1-intrinsic. For uSu\in\partial S, we further assume dϕ(u)𝐗[u]du𝐗[ϕ(u)]d_{\phi(u)}\operatorname{\boldsymbol{X}}[u]\neq d_{u}\operatorname{\boldsymbol{X}}[\phi(u)].

Notably, if the KK-step neighborhoods of two nodes vv and ϕ(v)\phi(v) are ϕ\phi-isomorphic and 𝐗\operatorname{\boldsymbol{X}} is any KK-intrinsic node feature matrix, then no structural difference relative to ϕ\phi and 𝐗\operatorname{\boldsymbol{X}} can be manifested at vv. Theorem 2 stated below shows that a scattering network will be able to produce different representations of two nodes vv and ϕ(v)\phi(v) if there are structural differences in their surrounding neighborhoods, assuming (among a few other mild assumptions) that a certain pathological case is avoided. The following notation and definition characterize this pathological situation that arises, roughly speaking, when two sums iai\sum_{i}a_{i} and ibi\sum_{i}b_{i} coincide, even though aia_{i} and bib_{i} do not coincide. We note that although this condition seems complicated, it is highly unlikely to occur in practice.

Notation 1.

Let ϕ:VV\phi:V\rightarrow V, vVv\in V, DD\in\mathbb{N}, such that G(𝒩v¯D)ϕG(𝒩ϕ(v)¯D)G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{D}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{D}\right). Assume that there exists at least one node in 𝒩v¯D\operatorname{\mathcal{N}}_{\underline{v}}^{D} where a structural difference is manifested rel. to ϕ\phi and node features 𝐗\operatorname{\boldsymbol{X}}. We define

Vdiff{u𝒩v¯D: struc. difference rel. to ϕ and 𝐗 at u}.V_{\text{diff}}\coloneqq\Big{\{}u\in\operatorname{\mathcal{N}}_{\underline{v}}^{D}:\text{ struc. difference rel. to $\phi$ and $\operatorname{\boldsymbol{X}}$ at }u\Big{\}}.

and we fix the following notations:

  1. (i)

    Let dmin{d(u,v):uVdiff}d\coloneqq\min\{d(u,v):u\in V_{\text{diff}}\} and define the node set Vdiffd{uVdiff:d(u,v)=d}V_{\text{diff}}^{d}\coloneqq\{u\in V_{\text{diff}}:d(u,v)=d\}. Note that these are exactly the nodes in 𝒩vd\operatorname{\mathcal{N}}_{v}^{d} where a structural difference is manifested relative to ϕ\phi and 𝐗0\operatorname{\boldsymbol{X}}^{0}.

  2. (ii)

    Let τ\tau\in\mathbb{N} be the number of paths of minimal length between vv and nodes in VdiffdV_{\text{diff}}^{d}, and denote these by p(i)(u0i,,udi=v)p^{(i)}\coloneqq\left(u_{0}^{i},\dots,u_{d}^{i}=v\right) with u0iVdiffdu_{0}^{i}\in V_{\text{diff}}^{d}, 1iτ1\leq i\leq\tau.

  3. (iii)

    Let Uji=1τ{uji}U_{j}\coloneqq\cup_{i=1}^{\tau}\{u_{j}^{i}\}, and define the generalized path from VdiffdV_{\text{diff}}^{d} to vv as P(U0,U1,,Ud)P\coloneqq(U_{0},U_{1},\dots,U_{d}).

  4. (iv)

    For uU0u\in U_{0}, define δu0𝐗0[u]𝐗0[ϕ(u)]\delta_{u}^{0}\coloneqq\operatorname{\boldsymbol{X}}^{0}[u]-\operatorname{\boldsymbol{X}}^{0}[\phi(u)].
    For uUju\in U_{j}, 0j=d0\leq j=\leq d and 𝐘j𝐏j𝐗0\operatorname{\boldsymbol{Y}}^{j}\coloneqq\operatorname{\boldsymbol{P}}^{j}\operatorname{\boldsymbol{X}}^{0}, define

    δuj12w𝒩uUj1dw1𝐘wj1dϕ(w)1𝐘ϕ(w)j1\textstyle\delta_{u}^{j}\coloneqq\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j-1}_{w}-d_{\phi(w)}^{-1}\operatorname{\boldsymbol{Y}}^{j-1}_{\phi(w)}

    and set

Δj{uUj:w𝒩uUj1such thatδwj10}.\Delta_{j}\coloneqq\{u\in U_{j}:\,\exists\,w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}\,\text{such that}\,\delta_{w}^{j-1}\neq 0\}.
Definition 4 (No Coincidental Correspondence).

Let 𝒩v¯d\operatorname{\mathcal{N}}_{\underline{v}}^{d} and Δj\Delta_{j} as in Notation 1. We say that int(𝒩v¯d)\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right) exhibits no coincidental correspondence (no-cc) if in each non-empty Δj\Delta_{j}, 1jd1\leq j\leq d, there exists at least one uu such that δuj0\delta_{u}^{j}\neq 0.

Theorem 2.

As in Theorem 1, let ϕ:VV\phi:V\rightarrow V, vVv\in V, and K,LK,L\in\operatorname{\mathbb{N}} such that G(𝒩v¯K+L)ϕG(𝒩ϕ(v)¯K+L)G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}, and consider any K-intrinsic node feature matrix 𝐗\operatorname{\boldsymbol{X}}. Further assume that there exists at least one structural difference rel. to ϕ\phi and 𝐗\operatorname{\boldsymbol{X}} in 𝒩v¯K+L\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}, and let 1dK+L1\leq d\leq K+L be the smallest positive integer such that a structural difference is manifested in 𝒩v¯d\operatorname{\mathcal{N}}_{\underline{v}}^{d}. If the nonlinearity σ\sigma is strictly monotonic and int(𝒩v¯d)\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right) exhibits no coincidental correspondence rel. to ϕ\phi and 𝐗\operatorname{\boldsymbol{X}}, then one can define a scattering configuration (p,𝚯)(p,\operatorname{\boldsymbol{\Theta}}) such that scattering features Fp-scat(𝐗)F_{p\text{-scat}}(\operatorname{\boldsymbol{X}}) defined as in Eq. 13 discriminate vv and ϕ(v)\phi(v).

Theorem 2 provides a large class of situations where scattering filters are able to discriminate between two nodes even if networks with smaller receptive fields are not. In particular, even if the K+LK+L step neighborhood of vv and ϕ(v)\phi(v) are isomorphic, it is possible for there to exist a uu in this K+LK+L step neighborhood such that a KK-intrinsic node feature takes different values at uu and ϕ(u)\phi(u). Theorem 2 shows that scattering will produce differing representations of vv and ϕ(v)\phi(v), while Theorem 1 shows that an LL-layer AC-GNNs like GCN will not. In the remarks below we will discuss minor variations of Theorem 2. In the appendix A.5, we will also present a modified version of Theorem 2, where the assumption of no coincidental correspondence is replaced by an assumption on the (generalized) path between the structural differences and the investigated node.

Remark 1.

The absolute value operation is not monotonic. Therefore, Theorem 2 cannot be directly applied in this case. However, the above result and all subsequent results can be extended to the case σ=||\sigma=|\cdot| as long as certain pathological cases are avoided. Namely, Theorem 2 will remain valid as long as the features at uu are assumed not to be the negatives of the features at ϕ(u)\phi(u), i.e., 𝐗[u]𝐗[ϕ(u)]\operatorname{\boldsymbol{X}}[u]\neq-\operatorname{\boldsymbol{X}}[\phi(u)] for nodes uint(𝒩v¯d)u\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right), and similarly we must modify Definition 4 to avoid, for example, the case where w𝒩uUj1dw1𝐗j1[w]=w𝒩uUj1dϕ(w)1𝐗j1[ϕ(w)]\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{w}^{-1}\operatorname{\boldsymbol{X}}^{j-1}[w]=-\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{\phi(w)}^{-1}\operatorname{\boldsymbol{X}}^{j-1}[\phi(w)]. We note that while this condition is complex, it is rarely violated in practice.

Remark 2.

A result analogous to Theorem 2 is also valid for any permutation of the three steps Eq. 14a-14c (even with added bias terms as in Eq. 16). In particular, it applies to the update rule 𝐗=σ(𝐔p(𝐗1)𝚯+𝐁).\operatorname{\boldsymbol{X}}^{\ell}=\sigma\left(\operatorname{\boldsymbol{U}}_{p}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\big{)}\operatorname{\boldsymbol{\Theta}}^{\ell}+\operatorname{\boldsymbol{B}}^{\ell}\right).

We will provide a sketch of the proof of Theorem 2 here and provide details in the appendix. The key to the proof will be applying Lemma 1 stated below.

Lemma 1.

Let ϕ:VV\phi:V\rightarrow V, v~V\tilde{v}\in V, and D~\tilde{D}\in\operatorname{\mathbb{N}} such that G(𝒩v¯~D~)ϕG(𝒩ϕ(v~)¯D~)G\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{D}}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(\tilde{v})}}^{\tilde{D}}\right). Assume there exists at least one node in 𝒩v¯~D~\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{D}} where a structural difference is manifested relative to ϕ\phi and to 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}, and let 1d~D~1\leq\tilde{d}\leq\tilde{D} be the smallest positive integer such that a structural difference is manifested in 𝒩v¯~d~\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}. Assume that int(𝒩v¯~d~)\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right) exhibits no coincidental correspondence relative to ϕ\phi and 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}, and let P~(U~0,U~1,,U~d~)\tilde{P}\coloneqq\left(\tilde{U}_{0},\tilde{U}_{1},\dots,\tilde{U}_{\tilde{d}}\right) be the generalized path as defined in Notation 1. Then, no structural difference is manifested in 𝒩v¯~d~jU~j\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-j}\setminus\tilde{U}_{j}, relative to ϕ\phi and filtered node features 𝐘j𝐏j𝐗~\operatorname{\boldsymbol{Y}}^{j}\coloneqq\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}} and there exists at least one node in uU~ju\in\tilde{U}_{j} and where a structural difference is manifested and δuj0\delta^{j}_{u}\neq 0.

Below, we provide a sketch of the main ideas of the proof. A complete proof is provided in Appendix A.3.

Proof Sketch.

We use induction to show that at every step the matrix 𝐏\operatorname{\boldsymbol{P}} propagates the information about the structural difference one node layer closer towards v~\tilde{v}.

By Notation 1(i), we have U~0=Vdiffd~\widetilde{U}_{0}=V_{\text{diff}}^{\tilde{d}}. Therefore, the case j=0j=0 is immediate. In the inductive step, it suffices to show

𝐘j+1[u]𝐘j+1[ϕ(u)]\operatorname{\boldsymbol{Y}}^{j+1}[u]\neq\operatorname{\boldsymbol{Y}}^{j+1}[\phi(u)]

and δuj+10\delta_{u}^{j+1}\neq 0 for at least one uU~j+1u\in\tilde{U}_{j+1} and

𝐘j+1[w]=𝐘j+1[ϕ(w)]\operatorname{\boldsymbol{Y}}^{j+1}[w]=\operatorname{\boldsymbol{Y}}^{j+1}[\phi(w)]

for all w𝒩v¯~d~(j+1)U~j+1w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1} under the assumption that these results are already valid for 𝐘j\operatorname{\boldsymbol{Y}}^{j}. This can be established by writing 𝐏j+1𝐗~=𝐏𝐏j𝐗~=𝐏𝐘j\operatorname{\boldsymbol{P}}^{j+1}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{Y}}^{j} and using the inductive hypothesis together with the definition of 𝐏\operatorname{\boldsymbol{P}}.∎

We now use Lemma 1 to sketch the proof of Theorem 2. The complete proof is provided in the appendix A.4.

Proof Sketch for Theorem 2.

We need to show that we can choose the parameters in a way that guarantees Fp-sct(𝐗)[v]Fp-sct(𝐗)[ϕ(v)]F_{p\text{-}sct}(\operatorname{\boldsymbol{X}})[v]\neq F_{p\text{-}sct}(\operatorname{\boldsymbol{X}})[\phi(v)]. For simplicity, we set 𝚯=𝐈n\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n}. In this case, since σ\sigma is strictly monotonic, and therefore injective, it suffices to show that we can construct pp with

𝐔p(𝐗)[v]𝐔p(𝐗)[ϕ(v)].\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[v]\neq\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[\phi(v)]. (21)

Using binary expansion, we may choose k1,,km0k_{1},\ldots,k_{m}\in\mathbb{N}_{0}, ki<ki+1k_{i}<k_{i+1}, such that d=2k1++2kmd=2^{k_{1}}+\ldots+2^{k_{m}} and set p(k1,,km)p\coloneqq(k_{1},\ldots,k_{m}). Given pp, we define truncated paths p:i(k1,,ki)p_{:i}\coloneqq(k_{1},\ldots,k_{i}) and let tij=1i2kjt_{i}\coloneqq\sum_{j=1}^{i}2^{k_{j}} for 1im.1\leq i\leq m. We also let p:0p_{:0} be the empty path of length 0 and set t0=0.t_{0}=0.

Recalling the generalized path P=(U0,,Ud)P=(U_{0},\ldots,U_{d}) defined in Notation 1, we will use induction to show that for at least one node in UtiU_{t_{i}} a structural difference is manifested, while no structural difference manifests in 𝒩v¯dtiUti\mathcal{N}_{\underline{v}}^{d-t_{i}}\setminus U_{t_{i}} rel. to ϕ\phi and 𝐙i𝐔p:i𝐗\operatorname{\boldsymbol{Z}}^{i}\coloneqq\operatorname{\boldsymbol{U}}_{p_{:i}}\operatorname{\boldsymbol{X}} for 0im0\leq i\leq m. Since tm=dt_{m}=d and p:m=pp_{:m}=p, this will imply Eq. 21 and thus prove the theorem. As with Lemma 1, the base case i=0i=0 follows from Notation 1(i). In the inductive step, it suffices to show that

𝐙i+1[u]𝐙i+1[ϕ(u)],\operatorname{\boldsymbol{Z}}^{i+1}[u]\neq\operatorname{\boldsymbol{Z}}^{i+1}[\phi(u)], (22)

for at least one uUti+1u\in U_{t_{i+1}} and

𝐙i+1[w]=𝐙i+1[ϕ(w)],\operatorname{\boldsymbol{Z}}^{i+1}[w]=\operatorname{\boldsymbol{Z}}^{i+1}[\phi(w)], (23)

for all w𝒩v¯dti+1Uti+1w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}, under the assumption that these results are true for ii. Since 𝐙i+1=𝚿ki+1σ(𝐙i)\operatorname{\boldsymbol{Z}}^{i+1}=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\big{(}\operatorname{\boldsymbol{Z}}^{i}\big{)} and σ\sigma is increasing and therefore injective, it suffices to show that these results are preserved under multiplication by a wavelet matrix 𝚿ki+1=𝐏2ki+11𝐏2ki+1\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}-\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}. This can be verified using Lemma 1 (with σ(𝐙i)\sigma(\operatorname{\boldsymbol{Z}}^{i}) in place of 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}) and the inductive hypothesis. ∎

Remark 3.

In the proof of Theorem 2, for the sake of concreteness, we chose 𝚯=𝐈n\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n} and pp such that |p|=j=1m2ki=d|p|=\sum_{j=1}^{m}2^{k_{i}}=d. However, inspecting the proof we see that our analysis may also be extended to paths with |p|d|p|\geq d and also that the choice of 𝚯\boldsymbol{\Theta} does not matter as long as 𝚯\operatorname{\boldsymbol{\Theta}} is invertible. In particular, if the entries of 𝚯\operatorname{\boldsymbol{\Theta}} are generated i.i.d. at random from a continuous probability distribution, the conclusions of the theorem will hold with probability one[41]. We also note that Theorems 1 and 2 may be extended to the case where the update rule features a bias term 𝐁\operatorname{\boldsymbol{B}} as in Eq. 15 and 16 since 𝐗𝐗+𝐁\operatorname{\boldsymbol{X}}\rightarrow\operatorname{\boldsymbol{X}}+\operatorname{\boldsymbol{B}} is injective.

Results analogous to Theorem 2 can likely also be proven for deep AC-GNNs with dd subsequent layers. The proof would closely follow the idea of Lemma 1 and would likely be slightly simpler than the proof of Theorem 2. Indeed, inspecting the proof of Theorem 2, we see that part of the reason we need both Eq. 22 and Eq. 23 is because the definition of 𝚿k\operatorname{\boldsymbol{\Psi}}_{k} involves subtraction. Therefore, establishing Eq. 23 could be skipped for an AC-GNN. However, we would need considerably stronger assumptions when working with a dd-layer AC-GNN FacdFac1F_{\text{ac}}^{d}\circ\dots\circ F_{\text{ac}}^{1}. Apart from an assumption analogous to no coincidental correspondence, we would additionally need injectivity of all functions σi,comi,gi\sigma_{i},\operatorname{com}_{i},g_{i} that characterize Faci,1idF_{\text{ac}}^{i},1\leq i\leq d, following Definition 1. Moreover, scattering filters are a more practical choice for two major reasons. Firstly, Scattering filters nicely balance the trade-off between oversmoothing and underreaching and are able to utilize a broad receptive field while still preserving information contained in higher frequencies as explained in Proposition 1 and the subsequent discussion. Secondly, scattering filters exhibit wide receptive fields with significantly fewer trained parameters.

6 Empirical Results

We now present our empirical results starting with semi-supervised node classification. We first show that our Sc-GCN model [33] is able to overcome the oversmoothing and underreaching problems and exhibits superior performance to either a pure GCN Network or a pure scattering network. We then show that we can further improve performance, particularly on complex datasets, by using GSAN[34], which incorporates an attention mechanism. Finally, we also show that our models perform well on graph-level tasks.

6.1 Semi-Supervised Node Classification

Scattering GCN. We compare the performance of Sc-GCN to both the original GCN [5] as well as the closely related ChebNet [28] and a network which uses Gaussian random fields to propagate label information [42]. Additionally, we compare against [13], which uses a pure-scattering approach. Further, we also compare Sc-GCN to two methods designed, in part, to address the oversmoothing problem. In [9], this problem is directly addressed by using partially absorbing random walks [43] to slow the mixing of node features in regions of the graph with high connectivity. The GAT graph attention network from [17], on the other hand, addresses this problem indirectly via an attention mechanism which trains an adaptive node-wise weighting of the smoothing operation. Lastly, we compare to an SVM classifier which ignores the graph structure.

In our experiments, we used four popular datasets summarized in Tab. I. For full details on these datasets, please see, e.g., [44] for Citeseer, Cora, and Pubmed and [45] for DBLP. As discussed in [9], the oversmoothing problem is most acute in highly connected networks where information mixes rapidly and nodes quickly become indistinguishable. Therefore, we included networks with different sizes and connectivity structures and order the datasets by connectivity structure (with the most connected on the right). Consistent with our expectation, Tab. II and Fig. 5 show that the advantages of our network are most significant for the networks with high connectivity.

We implemented all of these methods based upon the code made available by the original authors, and, for fair comparison, we tuned and evaluated each of these methods using the standard splits provided for benchmark datasets. Whenever possible, we made sure that our results are consistent with previously reported accuracies. We tuned our method, both the hyperparameters and number of scattering and GCN channels, via a grid search (over the same set for all datasets), and we used the same cross-validation procedure for our method and the competing methods. In Appendix B, we provide further details and also perform an ablation study evaluating the importance of each component in our proposed architecture.

TABLE I: Dataset characteristics: number of nodes, edges, and features; mean ±\pm std. of node degrees; ratio of #edges to #nodes.
Dataset Nodes Edges Features Degrees EdgesNodes\frac{\text{Edges}}{\text{Nodes}}
Citeseer 3,327 4,732 3,703 3.77±\pm3.38 1.42
Cora 2,708 5,429 1,433 4.90±\pm5.22 2.00
Pubmed 19,717 44,338 500 5.50±\pm7.43 2.25
DBLP 17,716 52,867 1639 6.97±\pm9.35 2.98
TABLE II: Classification accuracy (top two marked in bold; best one underlined) of Scattering GCN on four benchmark datasets (ordered by increasing connectivity) compared to four other GNNs [17, 9, 5, 28], a non-GNN approach [42] based on belief propagation, a pure graph scattering baseline [13], and a nongeometric baseline only using node features with linear SVM.
Model Citeseer Cora Pubmed DBLP
Sc-GCN (ours) 71.7 84.2 79.4 81.5
GAT [17] 72.5 83.0 79.0 66.1
Partially absorbing [9] 71.2 81.7 79.2 56.9
GCN [5] 70.3 81.5 79.0 72.0
Chebyshev [28] 69.8 78.1 74.4 57.3
Label Propagation [42] 58.2 77.3 71.0 53.0
Graph scattering [13] 67.5 81.9 69.8 69.4
Node features (SVM) 61.1 58.0 49.9 48.2

We report test classification accuracy in Tab. II, which shows that our approach outperforms other methods on three out of the four considered datasets. Only on Citeseer, we are slightly outperformed by GAT. However, this is the dataset with the weakest connectivity structure (see Tab. I), while at the same time having the most informative node features (achieving 61.1% accuracy via linear SVM without considering graph information). In contrast, on DBLP, which exhibits the richest connectivity structure and least informative features (only 48.2% SVM accuracy), we significantly outperform GAT (by over 15% improvement). Notably, GAT itself significantly outperforms all other methods (by 6.8% or more) apart from the graph scattering baseline [13].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Citeseer
Refer to caption
(b) Cora
Refer to caption
(c) Pubmed
Refer to caption
(d) DBLP
Figure 5: Impact of training set size (top) and training time (bottom) on classification accuracy and error (correspondingly); training size measured relative to the original training size of each dataset; training time and validation error plotted in logarithmic scale; runtime measured for all methods on the same hardware, using original implementations accompanying their publications.

The task of semi-supervised learning is motivated by the fact that in real-world applications one typically only has labels for a small fraction of the nodes. Therefore, we next analyze the performance of our network with limited amounts of training data. Fig. 5 (top) plots the training accuracy of Sc-GCN, as well as the other networks (except the baselines), as the amount of training data varies from 20% to 100% of its original amount. Across all networks, we see that Sc-GCN exhibits the best performance when the training data is sufficiently limited. Moreover, we see that it outperforms GAT, which is the next best performing network in terms of overall accuracy (see Tab. II), on all datasets whenever the training data is reduced to 60% of its original size or less.

In Fig. 5 (bottom), we plot the evolution of the validation error during the training process. We note that Sc-GCN is able to reach a low training error significantly faster than GAT, which is the next most competitive in terms of overall accuracy. Several other methods do achieve low validation errors faster on several datasets, but their final accuracy is much lower than that of Sc-GCN. Moreover, on Pubmed, which is the largest dataset, Sc-GCN achieves a low validation error at about the same rate as GCN and at a significantly faster rate than all other methods.

TABLE III: Dataset characteristics & comparison of node classification test accuracy. Datasets are ordered by increasing homophily.
Dataset Classes Nodes Edges Homophily GCN GAT Sc-GCN GSAN
Texas 5 183 295 0.11 59.5 58.4 60.3 60.5
Chameleon 5 2,277 31,421 0.23 28.2 42.9 51.2 61.2
CoraFull 70 19,793 63,421 0.57 62.2 51.9 62.5 64.5
Wiki-CS 10 11,701 216,123 0.65 77.2 77.7 78.1 78.6
Citeseer 6 3,327 4,676 0.74 70.3 72.5 71.7 71.3
Pubmed 3 19,717 44,327 0.80 79.0 79.0 79.4 79.8
Cora 7 2,708 5,276 0.81 81.5 83.0 84.2 84.0
DBLP 4 17,716 52,867 0.83 72.0 66.1 81.5 84.3

Geometric Scattering Attention Network. Having established the practical utility of the hybrid network Sc-GCN above, we next show that performance may be further improved by using GSAN, which builds upon Sc-GCN by incorporating an attention mechanism. We evaluate performance again for semi-supervised node classification and compare to two popular models (namely GCN [5] and GAT [17]), as well as the original Sc-GCN. We extend our experiments to eight benchmark datasets that we order according to increasing homophily, as shown in Tab. III. The homophily of a graph is quantified by the fraction of edges that connect nodes that share the same label. High homophily indicates a relatively smooth label distribution, while low homophily suggests a more complex labeling pattern. Texas and Chameleon are low-homophily datasets where nodes correspond to webpages and edges to links between them, with classes corresponding to webpage topic or monthly traffic (discretized into five levels), respectively [46]. Wiki-CS consists of nodes that represent computer science articles with the edges representing the hyperlinks [47]. CoraFull is the larger version of the Cora dataset [48]. The remaining four datasets (i.e., Cora, Citeseer, Pubmed, DBLP) are the citation networks that we already studied when evaluating Sc-GCN above. Notably, compared to the added datasets, the latter four exhibit relatively high homophily.

We split the datasets into train, validation and testing sets. The hyperparameters (number of attention heads Γ\Gamma, residual parameter α\alpha and channel widths) are tuned via grid search using the validation set.

The results in Tab. III show that both Sc-GCN and GSAN outperform GCN and GAT on seven out of eight datasets. The advantages of GSAN over Sc-GCN are particularly notable on the medium-size and large datasets Chameleon and CoraFull that exhibit relatively low-homophily. The most striking result is for Chameleon. Here, Sc-GCN clearly outperforms GCN and GAT (by at least 8.2%), but GSAN considerably improves performance even more (by additional 10%).

Analyzing the node-wise attention weights (see Sec. 4.2) allows us to understand the importance of different channels for different datasets. We consider here the ratio between attention assigned to band-pass and low-pass channels (over all attention heads). For that, we first sum up attention over low-pass and band-pass channels, respectively, i.e., 𝜶lowγi𝜶low,iγ\operatorname{\boldsymbol{\alpha}}_{\text{low}}\coloneqq\sum_{\gamma}\sum_{i}\boldsymbol{\alpha}_{\text{low},i}^{\gamma} and 𝜶bandγi𝜶band,iγ\operatorname{\boldsymbol{\alpha}}_{\text{band}}\coloneqq\sum_{\gamma}\sum_{i}\boldsymbol{\alpha}_{\text{band},i}^{\gamma}. Then, for each node vv, we calculate the ratio ζv𝜶band[v]/𝜶low[v]\zeta_{v}\coloneqq\operatorname{\boldsymbol{\alpha}}_{\text{band}}[v]/\operatorname{\boldsymbol{\alpha}}_{\text{low}}[v]. In Fig. 6, we present the distributions of {ζv:vV}\{\zeta_{v}:v\in V\} for four datasets.

Refer to caption
Figure 6: Distribution of attention ratios per node between band-pass (scattering) and low-pass (GCN) channels across all heads for DBLP, Chameleon, Citeseer, and WikiCS.

Examining the distributions, we observe four different dataset profiles. Citeseer and WikiCS exhibit minimal spread of attention ratios. In Citeseer, most of the attention is given to the low-pass channels, while WikiCS shows very balanced channel usage. In contrast, DBLP and Chameleon exhibit large spreads. Although the majority of nodes in both datasets values low-pass filters more, some nodes in Wiki-CS pay up to five times the attention to band-pass channels compared to low-pass ones (and vice versa). Interestingly, the distribution for Chameleon has two peaks, suggesting two node populations that require different band-information, which we interpret to be a driving factor for the improvement of GSAN on Chameleon (see Tab. III).

6.2 Graph Classification and Regression

Graph Classification. We use COLLAB and IMBD-BINARY as benchmark datasets. These are ego networks extracted from scientific collaborations and movie collaborations, respectively [49]. COLLAB contains 5,000 graphs with three classes and IMBD-BINARY contains 1,000 graphs with two classes. We use an 80-10-10 split between training, validation and test. The datasets contain graph structures but no associated graph signals. In our work, we compute the eccentricity (for connected graphs), degree and clustering coefficient of each vertex, and use these as input signals to our network.

We use a one-layer GSAN with 8 heads and 16 hidden units for COLLAB and one-layer GSAN with 4 heads and 16 hidden units for IMBD-BINARY, followed by one graph residual convolutional layer. We then apply a Set2Set module [50] followed by a multi-layer perceptron (MLP). Set2Set is a graph pooling architecture[51] where we apply an LSTM-based module for kk processing steps iteratively on the graph signal. During each processing step, node-wise feature signals with variable graph size are integrated using an LSTM-based attention mechanism. In our model, the GSAN-based layer outputs a graph signal 𝐗n×d\operatorname{\boldsymbol{X}}\in\operatorname{\mathbb{R}}^{n\times d} and the Set2Set layer outputs a graph level feature vector Set2Set(𝐗)2dSet2Set(\operatorname{\boldsymbol{X}})\in\operatorname{\mathbb{R}}^{2d} for each graph. After the Set2Set layer, we apply an MLP layer 2d\operatorname{\mathbb{R}}^{2d}\rightarrow\operatorname{\mathbb{R}} on the graph level signal for the classification task. We use a Set2Set layer with 3 processing steps and a two-layer MLP for graph pooling. We also provide a pure-scattering baseline (Sc-only) based on the approach proposed in [13] together with an SVM classifier. Our results in Tab. IV show that both proposed hybrid scattering networks show improvement over the compared baselines.

TABLE IV: Comparison of graph classification test accuracy (higher is better).
Dataset GCN GAT Sc-only Sc-GCN GSAN
COLLAB 0.592 0.523 0.640 0.690 0.704
IMBD-BINARY 0.710 0.632 0.710 0.740 0.760
TABLE V: Comparison of graph regression test error (lower is better).
Dataset GCN GAT Sc-only Sc-GCN GSAN
ZINC 0.469 0.463 0.51 0.452 0.430
Lipophilcity 1.05 0.950 1.19 1.03 1.02

Graph Regression. We use ZINC and Lipophilicity as benchmark datasets [52, 53]. For the ZINC dataset, the target is the penalized water-octanol partition coefficient of molecules. For the Lipophilcity dataset, the task is predicting the experimental octanol/water distribution coefficient of different molecules. ZINC contains 1,000 graphs and Lipophilcity contains 4,200 graphs. Graphs in ZINC have 75 input features and graphs in Lipophilcity have 9. We use a 80-10-10 split between training, validation and testing.

We again use a GSAN-based approach, this time consisting of two geometric scattering attention layers followed by one graph residual convolution layer. For ZINC, we use a 32-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 128 hidden units as the second layer. For Lipophilcity, we use a 4-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 96 hidden units as the second layer. The aggregation module uses an MLP, applied to the output node representations, producing a scalar output, followed by an average pooling on each molecule. Again, we provide a scattering-only baseline (Sc-only), where we used a modified version of GSAN, but with all low-pass channels removed, so that our network utilizes only scattering channels. Our results are presented in Tab. V, demonstrating the utility of our GSAN approach for this task as well.

7 Conclusion

Here, we studied GNN approaches for semi-supervised node classification and investigate some of the major limitations of today’s architectures. Many popular models are known to essentially rely on the low-pass filtering of graph signals. Based on this observation, we propose a novel hybrid GNN framework that can leverage higher frequency information not captured by traditional GCN models. Our construction is based on geometric scattering, a concept that was previously used mainly for graph classification, which we adapt to the node level. Our theoretical study suggests that scattering filters nicely balance the trade-off between oversmoothing and underreaching. Therefore, we are able to theoretically establish that the resulting scattering filters are more sensitive to the graph topology than a large class or GNN architectures including GCN when used in conjunction with node features that encode graph structure information. Empirically, we first evaluated our Sc-GCN model and demonstrated its efficacy in alleviating so-called oversmoothing. We then evaluated our GSAN model, which further improves performance, particularly in more complex (low-homophily) settings, via an attention framework. We also provide evidence that the proposed hybrid scattering networks perform well in graph-level tasks, both in classification and regression. In future work, one might further explore the potential of features that carry graph structure information empirically, and analyze more datasets in the context of graph-level tasks.

Acknowledgments

The authors would like to thank Dongmian Zou for fruitful discussions. This work was partially funded by Fin-ML CREATE graduate studies scholarship for PhD [F.W.]; IVADO (Institut de valorisation des données) grant PRF-2019-3583139727, FRQNT (Fonds de recherche du Québec - Nature et technologies) grant 299376, Canada CIFAR AI Chair [G.W.]; NSF (National Science Foundation) grant DMS-1845856 [M.H.]; and NIH (National Institutes of Health) grant NIGMS-R01GM135929 [M.H.,G.W.]. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

References

  • [1] M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
  • [2] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv:2104.13478, 2021.
  • [3] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning.   PMLR, 2017, pp. 1263–1272.
  • [4] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
  • [5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2016.
  • [6] A. M. Fout, “Protein interface prediction using graph convolutional networks,” Ph.D. dissertation, Colorado St. Univ., 2017.
  • [7] N. De Cao and T. Kipf, “Molgan: An implicit generative model for small molecular graphs,” arXiv:1805.11973, 2018.
  • [8] B. Knyazev, X. Lin, M. R. Amer, and G. W. Taylor, “Spectral multigraph networks for discovering and fusing relationships in molecules,” arXiv:1811.09595, 2018.
  • [9] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18).   Association for the Advancement of Artificial Intelligence, 2018, pp. 3538–3545.
  • [10] H. Nt and T. Maehara, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv:1905.09550, 2019.
  • [11] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” in International Conference on Learning Representations, 2019.
  • [12] F. Gao, G. Wolf, and M. Hirn, “Geometric scattering for graph data analysis,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2122–2131.
  • [13] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,” Applied and Computational Harmonic Analysis, vol. 49, no. 3, pp. 1046–1074, 2020.
  • [14] S. Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
  • [15] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
  • [16] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 53–94, 2006.
  • [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
  • [18] R. Liao, Z. Zhao, R. Urtasun, and R. Zemel, “Lanczosnet: Multi-scale deep graph convolutional networks,” in the 7th International Conference on Learning Representations (ICLR), 2019.
  • [19] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing,” in international conference on machine learning.   PMLR, 2019, pp. 21–29.
  • [20] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” arXiv:1904.07785, 2019.
  • [21] F. Gama, A. Ribeiro, and J. Bruna, “Stability of graph scattering transforms,” in NeurIPS, 2019, pp. 8038–8048.
  • [22] M. Perlmutter, F. Gao, G. Wolf, and M. Hirn, “Understanding graph neural networks with asymmetric geometric scattering transforms,” arXiv:1911.06253, 2019.
  • [23] D. Zou and G. Lerman, “Encoding robust representation for graph generation,” in 2019 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2019, pp. 1–9.
  • [24] E. Castro, A. Benz, A. Tong, G. Wolf, and S. Krishnaswamy, “Uncovering the folding landscape of rna secondary structure using deep graph embeddings,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 4519–4528.
  • [25] D. Bhaskar, J. D. Grady, M. A. Perlmutter, and S. Krishnaswamy, “Molecular graph generation via geometric scattering,” arXiv:2110.06241, 2021.
  • [26] A. Tong, F. Wenkel, K. MacDonald, S. Krishnaswamy, and G. Wolf, “Data-driven learning of geometric scattering networks,” arXiv:2010.02415, 2020.
  • [27] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  • [28] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 3844–3852.
  • [29] R. Levie, W. Huang, L. Bucci, M. Bronstein, and G. Kutyniok, “Transferability of spectral graph convolutional neural networks,” Journal of Machine Learning Research, vol. 22, no. 272, pp. 1–59, 2021.
  • [30] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Accelerated filtering on graphs using Lanczos method,” 2015, arXiv:1509.04537.
  • [31] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.
  • [32] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva, “The logical expressiveness of graph neural networks,” in International Conference on Learning Representations, 2020.
  • [33] Y. Min, F. Wenkel, and G. Wolf, “Scattering gcn: Overcoming oversmoothness in graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [34] ——, “Geometric scattering attention networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 8518–8522.
  • [35] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in the 6th International Conference on Learning Representations (ICLR), 2018.
  • [36] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2018.
  • [37] H. Wang and J. Leskovec, “Unifying graph convolutional neural networks and label propagation,” arXiv:2002.06755, 2020.
  • [38] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” arXiv:1806.08829, 2018.
  • [39] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” Journal of Complex Networks, vol. 9, no. 2, p. cnab014, 2021.
  • [40] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” in International Conference on Learning Representations, 2021.
  • [41] M. Rudelson, “Invertibility of random matrices: Norm of the inverse,” ANNALS OF MATHEMATICS, 2008.
  • [42] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919.
  • [43] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S. fu Chang, “Learning with partially absorbing random walks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., 2012, pp. 3077–3085.
  • [44] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning.   PMLR, 2016, pp. 40–48.
  • [45] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” Network, vol. 11, no. 9, p. 12, 2016.
  • [46] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” arXiv:2002.05287, 2020.
  • [47] P. Mernyei and C. Cangea, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv:2007.02901, 2020.
  • [48] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in International Conference on Learning Representations, 2018, pp. 1–13.
  • [49] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
  • [50] O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” arXiv:1511.06391, 2015.
  • [51] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [52] J. J. Irwin and B. K. Shoichet, “Zinc- a free database of commercially available compounds for virtual screening,” Journal of chemical information and modeling, vol. 45, no. 1, pp. 177–182, 2005.
  • [53] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513–530, 2018.
  • [54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
[Uncaptioned image] Frederik Wenkel received the B.Sc. degree in Mathematics and the M.Sc. degree in Mathematics at Technical University of Munich, in 2019. He is currently a Ph.D. candidate in Applied Mathematics at Université de Montréal (UdeM) and Mila (the Quebec AI institute), working on geometric deep learning. In particular, he is interested in graph neural networks and their applications in domains such as social networks, bio-chemistry and finance.
[Uncaptioned image] Yimeng Min is a Ph.D. student in Computer Science at Cornell University. He received his B.Sc. degree in Physics from Nanjing University and M.A.Sc. degree in Electrical and Computer Engineering from U. of Toronto. His research area is Artificial Intelligence with a focus on graph neural networks and constraint reasoning.
[Uncaptioned image] Matthew Hirn is an Associate Professor in the Dept. of CMSE and the Dept. of Mathematics at Michigan State University. He is the leader of the ComplEx Data Analysis Research (CEDAR) team, which develops new tools in computational harmonic analysis, machine learning, and data science for the analysis of complex, high dimensional data. Hirn received his B.A. in Mathematics from Cornell University and his Ph.D. in Mathematics from the University of Maryland, College Park. Before arriving at MSU, he held postdoctoral appointments in the Applied Math Program at Yale University and in the Department of Computer Science at École Normale Supérieure, Paris. He is the recipient of the Alfred P. Sloan Fellowship (2016), the DARPA Young Faculty Award (2016), the DARPA Director’s Fellowship (2018), and the NSF CAREER award (2019), and was designated a Kavli Fellow by the National Academy of Sciences (2017).
[Uncaptioned image] Michael Perlmutter is a Hedrick Assistant Adjunct Professor in the Dept. of Mathematics at the UCLA. He has also held postdoctoral positions in the Dept. of CMSE at Michigan State University and in the Dept. of Statistics and Operations Research at the University of North Carolina at Chapel Hill. He earned his Ph.D. in Mathematics from Purdue University in 2016. His research uses the methods of applied probability and harmonic analysis to develop and analyze methods for data sets with geometric structure.
[Uncaptioned image] Guy Wolf is an associate professor in the Depat. of Mathematics and Statistics (DMS) at the Université de Montréal (UdeM), a core academic member of Mila (the Quebec AI institute), and holds a Canada CIFAR AI Chair. He is also affiliated with the CRM center of mathematical sciences and the IVADO institute of data valorization. He holds an M.Sc. and a Ph.D. in computer science from Tel Aviv University. Previously, he was a postdoctoral researcher (2013-2015) in the Dept. of Computer Science at École Normale Supérieure in Paris (France), and a Gibbs Assistant Professor (2015-2018) in the Applied Math. Program at Yale University. His research focuses on manifold learning and geometric deep learning for exploratory data analysis, including methods for dimensionality reduction, visualization, denoising, data augmentation, and coarse graining. Also, he is particularly interested in biomedical data exploration applications of such methods, e.g., in single cell genomics/proteomics and neuroscience.

Appendix A

A.1 Supplemental Definitions for Sec. 5

Here, we provide additional definitions and notations.

Definition 5 (Node Discriminability).

Let 𝐗\operatorname{\boldsymbol{X}} be a node feature matrix, and let 𝐘=F(𝐗)\operatorname{\boldsymbol{Y}}=F(\operatorname{\boldsymbol{X}}) for some graph filter F:n×dXn×dYF:\mathbb{R}^{n\times d_{X}}\rightarrow\mathbb{R}^{n\times d_{Y}}. We say that FF discriminates two nodes v,wVv,w\in V based on 𝐗\operatorname{\boldsymbol{X}} if 𝐘[v]𝐘[w]\operatorname{\boldsymbol{Y}}[v]\neq\operatorname{\boldsymbol{Y}}[w].

Definition 6 (K-step Neighborhood).

For K0K\in\mathbb{N}_{0},777We use the conventions {1,2,3,}\operatorname{\mathbb{N}}\coloneqq\{1,2,3,\dots\}, 0{0}\operatorname{\mathbb{N}}_{0}\coloneqq\operatorname{\mathbb{N}}\cup\{0\} and k{n:nk}\operatorname{\mathbb{N}}_{\geq k}\coloneqq\{n\in\operatorname{\mathbb{N}}:n\geq k\}, for kk\in\operatorname{\mathbb{N}}. we define the K-step neighborhood of vv as 𝒩vK{uV:1d(u,v)K}\operatorname{\mathcal{N}}_{v}^{K}\coloneqq\{u\in V:1\leq d(u,v)\leq K\} (in particular 𝒩v0=\operatorname{\mathcal{N}}_{v}^{0}=\emptyset and 𝒩v1=𝒩v\operatorname{\mathcal{N}}_{v}^{1}=\operatorname{\mathcal{N}}_{v}). We further write 𝒩v¯K𝒩vK{v}\operatorname{\mathcal{N}}_{\underline{v}}^{K}\coloneqq\operatorname{\mathcal{N}}_{v}^{K}\cup\{v\} (in particular 𝒩v¯0={v}\operatorname{\mathcal{N}}_{\underline{v}}^{0}=\{v\}).

Definition 7 (Induced Subgraph).

Given a set VSVV_{S}\subseteq V, we refer to a subgraph S=(VS,ES)S=(V_{S},E_{S}) of GG as an induced subgraph, if ESEE_{S}\subset E is the set of all edges that connect nodes from VSV_{S}, i.e., ES=E(VS){eE:e={v,w}VS×VS}E_{S}=E(V_{S})\coloneqq\{e\in E:e=\{v,w\}\in V_{S}\times V_{S}\}. In this case, we will write SG(VS)S\coloneqq G(V_{S}).

Definition 8 (Graph Isomorphism).

A graph isomorphism between graphs G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}) is a bijection ϕ:VV\phi:V\rightarrow V^{\prime} such that {v,w}E\{v,w\}\in E if and only if {ϕ(v),ϕ(w)}E\{\phi(v),\phi(w)\}\in E^{\prime}. We say that GG and GG^{\prime} are ϕ\phi-isomorphic and write GϕGG\cong^{\phi}G^{\prime}. We also define ϕ(A){ϕ(w):wA}\phi(A)\coloneqq\{\phi(w):w\in A\} for AVA\subset V.

Definition 9 (Boundary and Interior of Induced Subgraph).

Let S=(VS,ES)S=(V_{S},E_{S}) be a an induced subgraph of G=(V,E)G=(V,E). We define the boundary of SS as S{wVS:d(w,VVS)=1}.\partial S\coloneqq\{w\in V_{S}:d(w,V\setminus V_{S})=1\}. Further, we define the interior of SS as int(S)VSSVS\operatorname{int}(S)\coloneqq V_{S}\setminus\partial S\subset V_{S}. When convenient, we will also use the notation VSS\partial V_{S}\coloneqq\partial S and int(VS)int(S)\operatorname{int}(V_{S})\coloneqq\operatorname{int}(S).

Definition 10 (Spatial Support of Graph Filter).

We say that a graph filter FF has spatial support L0L\in\operatorname{\mathbb{N}}_{0} if the value of F(𝐗)[v]F(\operatorname{\boldsymbol{X}})[v] depends only on values of 𝐗\operatorname{\boldsymbol{X}} at nodes within LL steps of vv, i.e., if for any vVv\in V and any node feature matrices 𝐗1\operatorname{\boldsymbol{X}}_{1} and 𝐗2\operatorname{\boldsymbol{X}}_{2}, we have F(𝐗1)[v]=F(𝐗2)[v]F(\operatorname{\boldsymbol{X}}_{1})[v]=F(\operatorname{\boldsymbol{X}}_{2})[v] whenever 𝐗1[w]=𝐗2[w]\operatorname{\boldsymbol{X}}_{1}[w]=\operatorname{\boldsymbol{X}}_{2}[w] for all w𝒩v¯Lw\in\mathcal{N}_{\underline{v}}^{L}.

A.2 Proof of Theorem 1

We use induction to show that 𝐗u=𝐗ϕ(u)\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)} for all u𝒩v¯Lu\in\operatorname{\mathcal{N}}_{\underline{v}}^{L-\ell} and all 0L0\leq\ell\leq L.

=0\ell=0:   Let u𝒩v¯Lu\in\operatorname{\mathcal{N}}_{\underline{v}}^{L}. By definition, we have d(u,v)L,d(u,v)\leq L, and so the triangle inequality implies that 𝒩u¯K𝒩v¯K+L.\operatorname{\mathcal{N}}_{\underline{u}}^{K}\subseteq\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}. This implies that G(𝒩u¯K)ϕG(𝒩ϕ(u)¯K)G\left(\operatorname{\mathcal{N}}_{\underline{u}}^{K}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(u)}}^{K}\right) and therefore the assumption that 𝐗0\operatorname{\boldsymbol{X}}^{0} is KK-intrinsic implies that 𝐗u0=𝐗ϕ(u)0\operatorname{\boldsymbol{X}}^{0}_{u}=\operatorname{\boldsymbol{X}}^{0}_{\phi(u)}.

+1\ell\rightarrow\ell+1:   Now assume that 𝐗u=𝐗ϕ(u)\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)} for all u𝒩v¯Lu\in\mathcal{N}_{\underline{v}}^{L-\ell}, and let w𝒩v¯L(+1)w\in\mathcal{N}_{\underline{v}}^{L-(\ell+1)}. Since 𝒩v¯L(+1)𝒩v¯L\mathcal{N}_{\underline{v}}^{L-(\ell+1)}\subseteq\mathcal{N}_{\underline{v}}^{L-\ell}, we have 𝐗w=𝐗ϕ(w)\operatorname{\boldsymbol{X}}^{\ell}_{w}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(w)}. Moreover, since the degree is 1-intrinsic, we have dw=dϕ(w),d_{w}=d_{\phi(w)}, and therefore f(𝐗w,dw)=f(𝐗ϕ(w),dϕ(w))f(\operatorname{\boldsymbol{X}}^{\ell}_{w},d_{w})=f(\operatorname{\boldsymbol{X}}^{\ell}_{\phi(w)},d_{\phi(w)}). Since 𝒩w¯𝒩v¯L,\mathcal{N}_{\underline{w}}\subseteq\mathcal{N}_{\underline{v}}^{L}, we have G(𝒩w¯)ϕG(𝒩ϕ(w)¯)G(\mathcal{N}_{\underline{w}})\cong^{\phi}G(\mathcal{N}_{\underline{\phi(w)}}). Therefore, since agg is a function defined on a multi-set the result will follow from Eq. 20 once we show that g(𝐗u,dw,du)=g(𝐗ϕ(u),dϕ(w),dϕ(u)g(\operatorname{\boldsymbol{X}}_{u}^{\ell},d_{w},d_{u})=g(\operatorname{\boldsymbol{X}}_{\phi(u)}^{\ell},d_{\phi(w)},d_{\phi(u)} for all u𝒩wu\in\mathcal{N}_{w}. Towards this end, we note that 𝐗u=𝐗ϕ(u)\operatorname{\boldsymbol{X}}_{u}^{\ell}=\operatorname{\boldsymbol{X}}_{\phi(u)}^{\ell} for all u𝒩wu\in\mathcal{N}_{w} by the inductive assumption and the fact that 𝒩w¯𝒩v¯L\mathcal{N}_{\underline{w}}\subseteq\mathcal{N}_{\underline{v}}^{L-\ell}. Similarly, since the degree is 1-intrinsic we have dw=dϕ(w)d_{w}=d_{\phi(w)} and du=dϕ(u)d_{u}=d_{\phi(u)}, and thus the proof is complete. ∎

A.3 Proof of Lemma 1

We will proceed by induction over jj.

j=0j=0:   Using the notation established in Notation 1, we have that U~0=Vdiffd~\widetilde{U}_{0}=V_{\text{diff}}^{\tilde{d}}, and as noted in (i), this is exactly the set of nodes in 𝒩v~d~\mathcal{N}_{\tilde{v}}^{\tilde{d}} where a structural difference manifests rel. to ϕ\phi and 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}. Also note that δu00\delta_{u}^{0}\neq 0 for uU~0u\in\tilde{U}_{0} according to (iv) of the notation. Thus the lemma holds or j=0j=0.

jj+1j\rightarrow j+1:   Assume Lemma 1 to hold for some 0j<d~0\leq j<\tilde{d}. Thus, by the definition of structural difference, and as dw=dϕ(w)d_{w}=d_{\phi(w)} for wint(𝒩v¯~d~)w\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right), our claim is equivalent to showing

𝐘j+1[u]𝐘j+1[ϕ(u)],\operatorname{\boldsymbol{Y}}^{j+1}[u]\neq\operatorname{\boldsymbol{Y}}^{j+1}[\phi(u)], (24)

and δuj+10\delta_{u}^{j+1}\neq 0 for at least one uU~j+1u\in\tilde{U}_{j+1} and

𝐘j+1[w]=𝐘j+1[ϕ(w)],\operatorname{\boldsymbol{Y}}^{j+1}[w]=\operatorname{\boldsymbol{Y}}^{j+1}[\phi(w)], (25)

for all w𝒩v¯~d~(j+1)U~j+1w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1}.

Since 𝐏j+1𝐗~=𝐏𝐏j𝐗~=𝐏𝐘j,\operatorname{\boldsymbol{P}}^{j+1}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{Y}}^{j}, we may use Eq. 8 to see that Eq. 24 is equivalent to

12𝐘j[u]+12w𝒩udw1𝐘j[w]\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[u]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{u}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]
\displaystyle\neq 12𝐘j[ϕ(u)]+12w𝒩ϕ(u)dw1𝐘j[w]\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[\phi(u)]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{\phi(u)}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w] (26)

for at least one uU~j+1𝒩v¯~dju\in\widetilde{U}_{j+1}\subseteq\mathcal{N}_{\tilde{\underline{v}}}^{d-j}. One may check that U~j+1U~j=\widetilde{U}_{j+1}\cap\widetilde{U}_{j}=\emptyset. Therefore, inductively applying Eq. 25 (with jj in place of j+1j+1) yields that 𝐘j[u]=𝐘j[ϕ(u)]\operatorname{\boldsymbol{Y}}^{j}[u]=\operatorname{\boldsymbol{Y}}^{j}[\phi(u)]. Additionally, the inductive assumption implies that 𝐘j[w]=𝐘j[ϕ(w)]\operatorname{\boldsymbol{Y}}^{j}[w]=\operatorname{\boldsymbol{Y}}^{j}[\phi(w)] for all w𝒩uU~jw\in\operatorname{\mathcal{N}}_{u}\setminus\tilde{U}_{j}. Therefore, Eq. A.3 is equivalent to

w𝒩uU~jdw1𝐘j[w]\displaystyle\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap\tilde{U}_{j}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w] w𝒩ϕ(u)ϕ(U~j)dw1𝐘j[w].\displaystyle\neq\sum_{w\in\operatorname{\mathcal{N}}_{\phi(u)}\cap\phi(\tilde{U}_{j})}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]. (27)

This indeed holds for at least one uU~j+1u\in\tilde{U}_{j+1} as Eq. 27 is equivalent to

0w𝒩uU~jdw1𝐘j[w]dϕ(w)1𝐘j[ϕ(w)]=δuj+1,0\neq\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap\tilde{U}_{j}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]-d_{\phi(w)}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]=\delta_{u}^{j+1}, (28)

and because int(𝒩v¯~d~)\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right) exhibits no-cc with Δj+1\Delta_{j+1}\neq\emptyset according to the inductive hypothesis. This completes the proof of Eq. 24.

Similarly Eq. 25 is equivalent to showing

12𝐘j[w]+12w~𝒩wdw~1𝐘j[w~]\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[w]+\frac{1}{2}\sum_{\tilde{w}\in\operatorname{\mathcal{N}}_{w}}d_{\tilde{w}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}]
=\displaystyle= 12𝐘j[ϕ(w)]+12w~𝒩ϕ(uj+1)dw~1𝐘j[w~],\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]+\frac{1}{2}\sum_{\tilde{w}\in\operatorname{\mathcal{N}}_{\phi(u_{j+1})}}d_{\tilde{w}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}], (29)

for all w𝒩v¯~d~(j+1)U~j+1w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1}. We first note that 𝒩w¯𝒩v¯~d~j\operatorname{\mathcal{N}}_{\underline{w}}\subset\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-j}, and that wU~jw\not\in\tilde{U}_{j}, since otherwise we would have d(v~,w)=d~j>d~(j+1).d(\tilde{v},w)=\tilde{d}-j>\tilde{d}-(j+1). Therefore, the inductive hypothesis implies 𝐘j[w]=𝐘j[ϕ(w)]\operatorname{\boldsymbol{Y}}^{j}[w]=\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]. Moreover, no w~𝒩w\tilde{w}\in\mathcal{N}_{w} can be an element of U~j\tilde{U}_{j} since otherwise, ww would be an element of U~j+1\tilde{U}_{j+1}, which is a contradiction. Therefore, the inductive assumption implies 𝐘j[w~]=𝐘j[ϕ(w~)]\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}]=\operatorname{\boldsymbol{Y}}^{j}[\phi(\tilde{w})] for all w~𝒩w\tilde{w}\in\operatorname{\mathcal{N}}_{w}. We have already noted that dw=dϕ(w)d_{w}=d_{\phi(w)} for all wint(𝒩v~D~)w\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\tilde{v}}^{\tilde{D}}\right). Thus, Eq. 29 holds and the proof is complete. ∎

A.4 Proof of Theorem 2

We need to show that we can choose the parameters in a way that guarantees Fpsct(𝐗)[v]Fpsct(𝐗)[ϕ(v)]F_{p-sct}(\operatorname{\boldsymbol{X}})[v]\neq F_{p-sct}(\operatorname{\boldsymbol{X}})[\phi(v)]. For simplicity, we set 𝚯=𝐈n\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n}. In this case, since σ\sigma strictly monotonic, and therefore injective, it suffices to show that we can construct pp such that

𝐔p(𝐗)[v]𝐔p(𝐗)[ϕ(v)].\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[v]\neq\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[\phi(v)]. (30)

Using binary expansion, we may choose k1,,km0k_{1},\ldots,k_{m}\in\mathbb{N}_{0}, ki<ki+1k_{i}<k_{i+1}, such that d=2k1+2kmd=2^{k_{1}}+\ldots 2^{k_{m}}. We will show that Eq. 30 holds for p(k1,,km).p\coloneqq(k_{1},\ldots,k_{m}). For 1im,1\leq i\leq m, let p:i(k1,,ki)p_{:i}\coloneqq(k_{1},\ldots,k_{i}) and let ti=j=1i2kjt_{i}=\sum_{j=1}^{i}2^{k_{j}}, where p:0p_{:0} denotes the empty path of length 0 and t0=0.t_{0}=0.

Recall the generalized path P=(U0,,Ud)P=(U_{0},\ldots,U_{d}) defined in Notation 1 (iii). We will use induction to show that, for 0im0\leq i\leq m, there exists at least one node in UtiU_{t_{i}} where a structural difference is manifested, while no structural differences manifests in 𝒩v¯dtiUti\mathcal{N}_{\underline{v}}^{d-t_{i}}\setminus U_{t_{i}} rel. to ϕ\phi and 𝐙i𝐔p:i𝐗.\operatorname{\boldsymbol{Z}}^{i}\coloneqq\operatorname{\boldsymbol{U}}_{p_{:i}}\operatorname{\boldsymbol{X}}. Since tm=dt_{m}=d and p:m=pp_{:m}=p, this claim will imply Eq. 30 and thus prove the theorem.

i=0i=0:   Analogous to the proof of Lemma 1, using the notation established in Notation 1 (ii), we have that U0=VdiffdU_{0}=V_{\text{diff}}^{d}, which are exactly the nodes in 𝒩vd\mathcal{N}_{v}^{d} where a structural difference manifests. Thus the claim holds for i=0i=0.

ii+1i\rightarrow i+1:   We now assume the result holds for ii. Since the degree is one-intrinsic and ti+11t_{i+1}\geq 1, we have that dv~=dϕ(v~)d_{\tilde{v}}=d_{\phi(\tilde{v})} for all v~𝒩v¯dti+1\tilde{v}\in\mathcal{N}_{\underline{v}}^{d-t_{i+1}}. Therefore, the claim is equivalent to showing

𝐙i+1[u]𝐙i+1[ϕ(u)],\operatorname{\boldsymbol{Z}}^{i+1}[u]\neq\operatorname{\boldsymbol{Z}}^{i+1}[\phi(u)], (31)

for at least one uUti+1u\in U_{t_{i+1}} and

𝐙i+1[w]=𝐙i+1[ϕ(w)],\operatorname{\boldsymbol{Z}}^{i+1}[w]=\operatorname{\boldsymbol{Z}}^{i+1}[\phi(w)], (32)

for all w𝒩v¯dti+1Uti+1w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}.

By the inductive assumption, Eq. 31 and Eq. 32 hold with ii in place of i+1i+1 and since σ\sigma is injective, they continue to hold when 𝐙i\operatorname{\boldsymbol{Z}}^{i} is replaced with σ(𝐙i)\sigma(\operatorname{\boldsymbol{Z}}^{i}). Therefore, we may apply Lemma 1 with σ(𝐙i)\sigma(\operatorname{\boldsymbol{Z}}^{i}) in place of 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}, and j=2ki+1j=2^{k_{i+1}}. By the inductive assumption, we have that d~\tilde{d} as defined in Lemma 1 is given by d~=dti\tilde{d}=d-t_{i}. Therefore, d~j=dti2ki+1=dti+1\tilde{d}-j=d-t_{i}-2^{k_{i+1}}=d-t_{i+1} and U~j=Uti+1\tilde{U}_{j}=U_{t_{i+1}}. Therefore, there exists at least one uUti+1u\in U_{t_{i+1}} where a structural difference is manifested, while for all w𝒩v¯dti+1Uti+1w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}, no structural difference is manifested rel. to ϕ\phi and 𝐏2ki+1σ(𝐙i)\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right), i.e.,

𝐏2ki+1σ(𝐙i)[u]𝐏2ki+1σ(𝐙i)[ϕ(u)]\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]\neq\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)] (33)

and

𝐏2ki+1σ(𝐙i)[w]=𝐏2ki+1σ(𝐙i)[ϕ(w)].\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)]. (34)

Next, we again apply Lemma 1 with σ(𝐙i)\sigma(\operatorname{\boldsymbol{Z}}^{i}) in place of 𝐗~\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}, but this time with j=2ki+11j=2^{k_{i+1}-1}. We observe now that d~j=dti2ki+11>dti+1\tilde{d}-j=d-t_{i}-2^{k_{i+1}-1}>d-t_{i+1} and thus u,w𝒩v¯d~ti+1𝒩v¯d~jU~ju,w\in\operatorname{\mathcal{N}}_{\underline{v}}^{\tilde{d}-t_{i+1}}\subset\operatorname{\mathcal{N}}_{\underline{v}}^{\tilde{d}-j}\setminus\tilde{U}_{j}. Therefore, no structural difference is manifested at either uu or ww rel. to ϕ\phi and 𝐏2ki+11\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}, i.e.,

𝐏2ki+11σ(𝐙i)[u]=𝐏2ki+11σ(𝐙i)[ϕ(u)]\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)] (35)

and

𝐏2ki+11σ(𝐙i)[w]=𝐏2ki+11σ(𝐙i)[ϕ(w)].\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)]. (36)

Together Eq. 33-34 and Eq. 35-36 imply

𝚿ki+1σ(𝐙i)[u]𝚿ki+1σ(𝐙i)[ϕ(u)]\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]\neq\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)]

and

𝚿ki+1σ(𝐙i)[w]=𝚿ki+1σ(𝐙i)[ϕ(w)].\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)].

Eq. 31 and 32 now follow as 𝐙i+1=𝚿ki+1σ(𝐙i)\operatorname{\boldsymbol{Z}}^{i+1}=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right). ∎

A.5 Modified Variant of Theorem 2

Theorem 3.

Similar to Theorem 2, let ϕ:VV\phi:V\rightarrow V, vVv\in V and K,LK,L\in\operatorname{\mathbb{N}} such that G(𝒩v¯K+L)ϕG(𝒩ϕ(v)¯K+L)G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}, and consider any K-intrinsic node feature matrix 𝐗\operatorname{\boldsymbol{X}}. Suppose there exist nodes U𝒩vK+LU\subset\operatorname{\mathcal{N}}_{v}^{K+L} where a structural difference rel. to ϕ\phi and 𝐗\operatorname{\boldsymbol{X}} is manifested. If the nonlinearity σ\sigma is strictly monotonic and there exists a unique uUu\in U with minimal distance from vv and a unique shortest path between uu and vv, we can define a scattering configuration (p,𝚯)(p,\operatorname{\boldsymbol{\Theta}}) such that scattering features Fp-scat(𝐗)F_{p\text{-scat}}(\operatorname{\boldsymbol{X}}) defined as in Eq. 13 discriminate vv and ϕ(v)\phi(v).

In Theorem 3, we replace the assumption of no coincidental correspondence (Definition 4) in Theorem 2 by the assumption of a unique shortest path between vv and some node uu, which is the unique nearest node from vv, where a structural difference is manifested.

Proof of Theorem 3.

The only difference between Theorem 3 and Theorem 2 is that in Theorem 2 we assume that there is no coincidental correspondence whereas here we assume that there is a unique uUu\in U with minimal distance from vv and that there is a unique shortest path between vv and uu. Inspecting the proof of Theorem 2, we see that the only spot where we used the assumption of no coincidental correspondence was when we invoked Lemma 1. Therefore, it suffices to show that the conclusion of Lemma 1 holds under our revised assumptions. Moreover, inspecting the proof of Lemma 1, we see that the only spot we use the assumption of no coincidental correspondence was to establish Eq. 28. Thus, it suffices to show that Eq. 28 still holds.

Let P(U0,U1,,Ud)P\coloneqq(U_{0},U_{1},\dots,U_{d}) be the generalized path from uu to vv as defined in Notation 1, and note that our assumption of a unique shortest path implies that each UjU_{j} consists of a single node uju_{j}, i.e., Uj{uj}U_{j}\coloneqq\{u_{j}\}. Since the assumption of no coincidental correspondence is not used in the base case, we can use the inductive hypothesis that uju_{j} is the unique element of 𝒩v¯dj\mathcal{N}_{\underline{v}}^{d-j} where a structural difference manifests with respect to 𝐘j\operatorname{\boldsymbol{Y}}^{j}. We are then left to show that this implies that Eq. 28 holds. But Eq. 28 simplifies to 0duj1𝐘j[uj]dϕ(uj)1𝐘j[ϕ(uj)]0\neq d_{u_{j}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[u_{j}]-d_{\phi(u_{j})}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\phi(u_{j})], which is true according to the inductive hypothesis. ∎

Appendix B

B.1 Technical Details

TABLE VI: Chosen hyperparameters for our Sc-GCN model on the examined datasets.
Scat. config.:\overbrace{\hskip 60.0pt}^{\text{Scat.\ config.:}} Channel widths:\overbrace{\hskip 130.0pt}^{\text{Channel widths:}}
α\alpha qq 𝑼p1\boldsymbol{U}_{p_{1}} 𝑼p2\boldsymbol{U}_{p_{2}} 𝑨1\,\boldsymbol{A}^{1}\, 𝑨2\,\boldsymbol{A}^{2}\, 𝑨3\,\boldsymbol{A}^{3}\, 𝑼p1\boldsymbol{U}_{p_{1}} 𝑼p2\boldsymbol{U}_{p_{2}}
Citeseer 0.50 4 𝚿2\boldsymbol{\Psi}_{2} 𝚿2|𝚿3|\boldsymbol{\Psi}_{2}|\boldsymbol{\Psi}_{3}| 10 10 10 9 30
Cora 0.35 4 𝚿1\boldsymbol{\Psi}_{1} 𝚿3\boldsymbol{\Psi}_{3} 10 10 10 11 6
Pubmed 1.00 4 𝚿1\boldsymbol{\Psi}_{1} 𝚿2\boldsymbol{\Psi}_{2} 10 10 10 13 14
DBLP (1st layer) 1.00 4 𝚿1\boldsymbol{\Psi}_{1} 𝚿2\boldsymbol{\Psi}_{2} 10 10 10 30 30
DBLP (2nd layer) 0.10 1 𝚿1\boldsymbol{\Psi}_{1} 𝚿2\boldsymbol{\Psi}_{2} 40 20 20 20 20
TABLE VII: Chosen hyperparameters for our GSAN model on the examined datasets.
α\alpha Number of heads Channel width
Texas 1.5 4 128
Chameleon 0.2 64 32
CoraFull 0.5 20 128
Wiki-CS 0.3 20 20
Citeseer 0.1 8 64
Pubmed 0.1 50 64
Cora 0.1 50 64
DBLP 0.2 16 128

As with most neural networks, when implementing Sc-GCN and GSAN, one must make several architecture choices and tune several hyperparameters. In our experiments with Sc-GCN, we used either one or two hybrid layers, each comprised of three GCN channels and two scattering channels, followed by a single residual convolution layer. This fairly simple setup simplified the network tuning process and was sufficient to obtain strong results, which outperformed numerous competing methods. However, it is likely that performance can be further improved by using wider or deeper networks. For Cora, Citeseer and Pubmed we used only one hybrid layer as preliminary experiments indicated that the addition of a second one was not cost-effective (in light of additional complexity created by the need to tune more hyperparameters). For DBLP, two layers were used due to a significant increase in performance. We note that even with a single hybrid layer our model achieves 73.1%73.1\% test accuracy (compared to 81.5%81.5\% with two layers) and still significantly outperforms GAT (66.1%66.1\%) and the other methods (below 60%60\%).

To illustrate the utility of our model on large and challenging data sets, we now compare against GCN and GAT on two OGB benchmark node classification tasks. Our results are shown in Tab VIII. GSAN achieves 73.9% using 3,832 parameters on ogbn-proteins and 72.3% on ogbn-arxiv using 70,760 parameters, while GCN achieves 72.5% ogbn-proteins and using 110,120 parameters and 71.7% on ogbn-arxiv using 96,880 parameters, GAT achieves 73.6% on ogbn-proteins using 1,238,554 parameters and 71.2% on on ogbn-arxiv using 950,620 parameters. This suggests that our model can be more parameter-efficient on large datasets, especially on ogbn-proteins dataset, where we only use 3.5% of the GCN’s parameter counts. Furthermore, GSAN achieves 76.2% on ogbn-proteins using 31,856 when we increase the width of GSAN. Thus, we see that our network is able to get better performance with fewer learned parameters.

TABLE VIII: Comparison of node classification test accuracy on Open Graph Benchmark datasets proteins and arxiv (higher is better).
Dataset #nodes #edges GCN GAT Sc-GCN GSAN
proteins 1.3e5 4.0e7 0.725 0.736 0.742 0.739
arxiv 1.7e5 1.1e6 0.717 0.712 0.719 0.723

Validation & testing procedure: All tests were done using train-validation-test splits of the datasets. We used the validation accuracy to tune the hyperparameters and reported the test accuracy in the comparison tables. To ensure fair comparison, we used the same splits for all methods. On Citeseer, Cora and Pubmed, we used the same settings as in [5], and followed the standard practices used in other works which have used these datasets. For DBLP, as far as we know, no common standard is established in the literature. Here, we used a ratio of 5:1:15:1:1 between train, validation, and test for all node-classification datasets. For the datasets for graph classification and regression, we used a ratio of 8:1:18:1:1.

Hyperparameter tuning: We tuned our hyperparameters on each set using a grid search and selected the setting which yields the highest accuracy on the validation set.

In Sc-GCN, we used the grid search to tune the parameter α\alpha used in the residual convolution, the exponent qq used in the nonlinearity, the scattering channel configurations p1,p2p_{1},p_{2} (i.e., scales used in these two channels), and the widths of channels in the network, i.e. the hidden dimensions. The results of this tuning process are presented in Tab. VI. For DBLP, the hybrid-layer parameters are shared between the two layers in order to simplify the tuning process. This was generally less exhaustive than the process used for the other three datasets, since our method significantly outperformed all other methods even with limited tuning.

TABLE IX: Classification accuracies on ogbn-proteins dataset with different α\alpha.
α\alpha 0.01 0.1 0.5
Accuracy 72.32±0.5572.32\pm 0.55 72.03±0.8172.03\pm 0.81 72.45±1.3072.45\pm 1.30
α\alpha 1.0 3.0 5.0
Accuracy 73.93±0.1973.93\pm 0.19 72.23±0.4572.23\pm 0.45 71.45±1.4171.45\pm 1.41

For GSAN, the hyperparameter selection is less intricate, which in addition to improved performance, is one of the main advantages compared to Sc-GCN. In particular, we use the same scattering configurations and channel widths across all channels in a given layer. We do not tune the scattering configurations. We always set m=1m=1 and always use the wavelets 𝚿1,𝚿2,\operatorname{\boldsymbol{\Psi}}_{1},\operatorname{\boldsymbol{\Psi}}_{2}, and 𝚿3\operatorname{\boldsymbol{\Psi}}_{3} in our three scattering channels. Therefore, the hyperparameters simplify to α\alpha, a (universal) channel width and the number of attention heads. These parameters might differ across different layers if the architecture relies on several ones. The results of this tuning process are presented in Tab. VII.

Hardware & software environment: All experiments were done on the same HPC cluster with intel i7-6850K CPU and NVIDIA TITAN X Pascal GPU. Both Sc-GCN and GSAN were implemented using PyTorch [54]. Implementations of all other methods were taken directly from the code accompanying their publications.

B.2 Ablation Study

The two primary novelties in Sc-GCN and GSAN are the scattering channels (i.e., 𝚿1,𝚿2,\operatorname{\boldsymbol{\Psi}}_{1},\operatorname{\boldsymbol{\Psi}}_{2}, and 𝚿3\operatorname{\boldsymbol{\Psi}}_{3}) and the residual convolution (controlled by the hyperparameter α\alpha). To explore their contribution and the dependence of the networks performance on the hyperparameters, Tab. IX show classification results over the ogbn-proteins dataset for α=0.01,0.1,0.5,1.0,3.0,5.0\alpha=0.01,0.1,0.5,1.0,3.0,5.0 over multiple scattering channel configurations. For simplicity, we focus on ogbn-proteins, but note that similar results are observed on the other datasets.

TABLE X: Impact of removing each individual channel from the optimal configuration on ogbn-proteins, while classifying using the remaining four channels. Full GSAN accuracy is 73.93% .
Channel 𝐀\operatorname{\boldsymbol{A}} 𝐀2\operatorname{\boldsymbol{A}}^{2} 𝐀3\operatorname{\boldsymbol{A}}^{3} 𝚿3\operatorname{\boldsymbol{\Psi}}_{3} 𝚿2\operatorname{\boldsymbol{\Psi}}_{2} 𝚿1\operatorname{\boldsymbol{\Psi}}_{1}
Accuracy 71.22 71.23 72.16 71.47 72.98 70.75

To examine the importance of the parameter α\alpha used in the residual graph convolution layer, we observe that setting α=0\alpha=0 effectively eliminates this layer (since then 𝐀res=𝐈n\operatorname{\boldsymbol{A}}_{\text{res}}=\operatorname{\boldsymbol{I}}_{n}). On the other hand, increasing α\alpha makes the filtering effect stronger, approaching a random-walk filtering as α\alpha\rightarrow\infty. We evaluate the effect of this parameter on classification accuracy. Our results, displayed in Tab. IX, indicate that increasing α\alpha to non-negligible nonzero values improves classification performance, which we interpret to be due to the removal of high-frequency noise. However, when α\alpha further increases (in particular when α=5\alpha=5 in this case) the smoothing provided by this layer degrades the performance to a level close to the traditional GCN [5].

Finally, to examine the relative importance of the low-pass and band-pass information, we provide an ablation study of the impact each channel. We focus on the ogbn-proteins dataset, set α=1.0\alpha=1.0 and use the best channel configuration, which achieved 73.93% accuracy when all channels were used. Then, we remove each of the band-pass or low-pass channels individually from the network, and reevaluate network performance with the remaining four channels. Our results are displayed in Tab. X. They indicate that the information captured by both the band-pass and the low-pass channels is important and that eliminating either decreases performance. In particular, we note that eliminating 𝚿1\boldsymbol{\Psi}_{1} has a major impact on the accuracy.

B.3 Implementation

Python code accompanying this work is available on github.com/dms-net/scatteringGCN and github.com/dms-net/Attention-based-Scattering.