Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks

Frederik Wenkel^∗, Yimeng Min^∗, Matthew Hirn, Michael Perlmutter^†, and Guy Wolf^†,∗∗ (^∗) The first two authors contributed equally. (^†) The last two authors jointly supervised the work. (^∗∗) Corresponding author.F. Wenkel and G. Wolf are with Mila (Quebec Artificial Intelligence Institute) and Department of Mathematics and Statistics, University of Montreal, Montreal, QC H3T1J4 (e-mail: [email protected], [email protected])Y. Min is with the Department of Computer Science, Cornell University, New York, NY 14850, USA (e-mail: [email protected])M. Hirn is with Department of Computational Mathematics, Science & Engineering, the Department of Mathematics, and the Center for Quantum Computing, Science & Engineering, Michigan State University, East Lansing, MI 48824, USA, and the Midwest Quantum Collaboratory (e-mail: [email protected])M. Perlmutter is with Department of Mathematics, University of California, Los Angeles, CA 90095, USA (e-mail: [email protected])

Abstract

Geometric deep learning has made great strides towards generalizing the design of structure-aware neural networks from traditional domains to non-Euclidean ones, giving rise to graph neural networks (GNN) that can be applied to graph-structured data arising in, e.g., social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models are often constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, leading to oversmoothing. Moreover, to avoid severe oversmoothing, most popular GCN-style networks tend to be shallow, with narrow receptive fields, leading to underreaching. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via geometric scattering. We further introduce an attention framework that allows the model to locally attend over combined information from different filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.

Index Terms:

Geometric Deep Learning, Graph Neural Networks, Geometric Scattering, Oversmoothing, Underreaching.

1 Introduction

Deep learning is typically most effective when the structure of the data can be used to design the architecture of the relevant network. For example, the design of recurrent neural networks is informed by the sequential nature of time-series data. Similarly, the design of convolutional neural networks is based in part on the fact that the pixels of an image are arranged in a rectangular grid. The success of neural networks in these, as well as many other applications, has inspired the rise of geometric deep learning[1, 2], which aims to extend the success of deep learning to other forms of structured data and to develop intelligent methods for data sets that have a non-Euclidean structure.

A common approach in geometric deep learning is to model the data by a graph. In many applications, this is done by defining edges between data points that interact in a specific way, e.g., “friends” on a social network. In many other applications, one may construct a graph from a high-dimensional data set, either by defining an edge between each point and its $k$ -nearest neighbors or by defining weighted edges via a similarity kernel. Inspired by the increasing ubiquity of graph-structured data, numerous recent works have shown graph neural networks (GNNs) to perform well in a variety of fields including biology, chemistry and social networks [3, 4, 5]. In these methods, the graph is often considered in conjunction with a set of node features, which contain “local” information about, e.g., each user of a social network.

One common family of tasks are graph-level tasks, where one seeks to learn a whole-graph representation for the purposes of, e.g., predicting properties of proteins [6, 7, 8]. Another common family, which has been the primary focus of graph convolutional networks (GCNs) [5], are node-level tasks such as node classification. There, the entire data set is modeled as one large graph and the network aims to produce a useful representation of each node using both the node features and the graph structure. This work is typically conducted in the semi-supervised setting where one only knows the labels of a small fraction of the nodes.

Many popular state-of-the-art GNNs essentially aim to promote similarity between adjacent nodes, which may be interpreted as a smoothing operation. While this is effective in certain settings, it can also cause a decrease in performance because of the oversmoothing problem [9], where nodes become increasingly indistinguishable from one another after each subsequent layer. In order to partially mitigate the oversmoothing problem, many popular GNNs only use two or three layers. While this does help avoid oversmoothing, it creates a new problem, underreaching, where the network is unable to incorporate long-range dependencies or global geometry.

Here, we propose to augment traditional GNN architectures by also including novel band-pass filters, in addition to conventional GCN-style filters¹¹1Throughout this text, we will use the term GCN to refer to the network introduced by Kipf and Welling in [5]. We will use the term GNN to refer to graph neural networks (spectral, convolutional, or otherwise) in general. that essentially perform low-pass filtering [10], in order to extract richer representations for each node. This approach is based on the geometric scattering transform [11, 12, 13], whose construction is inspired by the Euclidean scattering transform introduced by Mallat in [14], and utilizes iterative cascades of graph wavelets [15, 16] and pointwise nonlinear activation functions to produce useful graph data representations. Notably, these representations are able to capture both high-frequency information and long-range dependencies.

The main contribution of this work is two hybrid GNN frameworks that utilize both traditional GCN-style filters and also novel filters based on the scattering transform. This approach is based on the following simple idea: GCN-based networks are very useful, but as they aim to enforce similarity among nodes, they essentially focus on low-frequency information. Wavelets, on the other hand, are naturally equipped to capture high-frequency information. In the spatial domain, the GCN-style filters can be thought of as focusing on localized information where the wavelets are able to capture longer-range interactions. Therefore, in a hybrid network, the different channels capture different types of information. Such a network is therefore more powerful than a network that only uses one style of filter.

We also introduce complementary GNN modules that enhance the performance of such hybrid scattering models, including (i) the graph residual convolution, an adaptive low-pass filter that corrects high-frequency noise, and (ii) an attention framework that enables the aggregation of information from different filters individually at every node. We present theoretical results, based off of a new notion of graph structural difference, that highlight the sensitivity of scattering filters to graph regularity patterns not captured by GCN filters due to underreaching. Extensive empirical experiments demonstrate the ability of hybrid scattering networks for (transductive) semi-supervised node classification, to (i) alleviate oversmoothing and (ii) generalize to complex (low-homophily) datasets. Moreover, we also show that our framework translates well to inductive graph-level tasks.

The remainder of this paper is organized as follows. We review related work on GNN models and geometric scattering in Sec. 2 and introduce important concepts that will be used throughout this work in Sec. 3. We then formulate the hybrid scattering network in Sec. 4, followed by a theoretical study of the benefits of such models in Sec. 5. In Sec. 6, we present empirical results before concluding in Sec. 7.

2 Related Work

Theoretical analyses[9, 10] of GCN and related models show that they may be viewed as Laplacian smoothing operations and, from the signal processing perspective, essentially perform low-pass filters on the graph features. One approach towards addressing this problem is the graph attention network proposed by [17], which uses self-attention mechanisms to address these shortcomings by adaptively reweighting local neighborhoods. In [18], the authors construct a low-rank approximation of the graph Laplacian that efficiently gathers multi-scale information and demonstrate the effectiveness of their method on citation networks and the QM8 quantum chemistry dataset. In [19], the authors take an approach similar to GCN, but use multiple powers of the adjacency matrix to learn higher-order neighborhood information. Finally, in [20] the authors used graph wavelets to extract higher-order neighborhood.

In addition to the learned networks discussed above, several works[11, 12, 13] have introduced different variations of the graph scattering transform. These papers aim to extend the Euclidean scattering transform of Mallat [14] to graph-structured data and propose predesigned, wavelet-based networks. In [13, 11, 21, 22], extensive theoretical studies of these networks show that they have desirable stability, invariance, and conservation of energy properties. The practical utility of these networks has been established in [12], which primarily focuses on graph classification, and in [23, 24, 25], which used the graph scattering transform to generate molecules. Building off of these results, which use handcrafted formulations of the scattering transform, recent work [26] has proposed a framework for a data-driven tuning of the traditionally handcrafted geometric scattering design that maintains the theoretical properties from traditional designs, while also showing strong empirical results in whole-graph settings.

3 Geometric Deep Learning Background

3.1 Graph Signal Processing

Let $G=(V,E,w)$ be a weighted graph, characterized by a set of nodes (also called vertices) $V\coloneqq\{v_{1},\dots,v_{n}\}$ , a set of undirected edges $E\subseteq V\times V$ , and a function $w:E\to(0,\infty)$ assigning positive edge weights to the edges. Let $\operatorname{\boldsymbol{X}}\in\operatorname{\mathbb{R}}^{n\times d_{X}}$ be a node features matrix. We shall interpret the $i^{\text{th}}$ row of $\operatorname{\boldsymbol{X}}$ as representing the features of the node $v_{i}$ , and therefore, we shall denote these rows by either $\operatorname{\boldsymbol{X}}[v_{i}]\in\operatorname{\mathbb{R}}^{1\times d_{X}}$ or $\operatorname{\boldsymbol{X}}_{v_{i}}\in\operatorname{\mathbb{R}}^{1\times d_{X}}$ . The columns of $\operatorname{\boldsymbol{X}}$ , on the other hand, will be denoted by $\operatorname{\boldsymbol{x}}_{i}$ . Each of these columns may be naturally identified with a graph signal, i.e., a function $x_{i}:V\rightarrow\mathbb{R}$ , $x_{i}(v_{j})\coloneqq\operatorname{\boldsymbol{x}}_{i}[j]$ . In what follows, for simplicity, we will not distinguish between the vectors $\operatorname{\boldsymbol{x}}_{i}$ and the functions $x_{i}$ and will refer to both as graph signals.

We define the weighted adjacency matrix $\operatorname{\boldsymbol{W}}\in\operatorname{\mathbb{R}}^{n\times n}$ of $G$ by $\operatorname{\boldsymbol{W}}[v_{i},v_{j}]\coloneqq w(v_{i},v_{j})$ if $\{v_{i},v_{j}\}\in E$ , and set all other entries to zero. We further define the degree matrix $\operatorname{\boldsymbol{D}}\in\operatorname{\mathbb{R}}^{n\times n}$ as the diagonal matrix $\operatorname{\boldsymbol{D}}\coloneqq\operatorname{diag}(\operatorname{deg}(v_{1}),\dots,\operatorname{deg}(v_{n}))$ with each diagonal element $\operatorname{deg}(v_{i})\coloneqq\sum_{j=1}^{n}\operatorname{\boldsymbol{W}}[v_{i},v_{j}]$ being the degree of a node $v_{i}$ . In the following, we will also use the shorthand $d_{v_{i}}$ to denote the degree of $v_{i}$ . We consider the combinatorial graph Laplacian matrix $\operatorname{\boldsymbol{L}}\coloneqq\operatorname{\boldsymbol{D}}-\operatorname{\boldsymbol{W}}$ and the symmetric normalized Laplacian given by

\operatorname{\boldsymbol{\mathcal{L}}}\coloneqq\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{L}}\operatorname{\boldsymbol{D}}^{-1/2}=\operatorname{\boldsymbol{I}}_{n}-\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}.

It is well known that this matrix is symmetric, positive semi-definite, and admits an orthonormal basis of eigenvectors such that $\operatorname{\boldsymbol{\mathcal{L}}}\operatorname{\boldsymbol{q}}_{i}=\lambda_{i}\operatorname{\boldsymbol{q}}_{i}.$ Therefore, we may write

\operatorname{\boldsymbol{\mathcal{L}}}=\operatorname{\boldsymbol{Q}}\operatorname{\boldsymbol{\Lambda}}\operatorname{\boldsymbol{Q}}^{T}=\sum_{i=1}^{n}\lambda_{i}\operatorname{\boldsymbol{q}}_{i}\operatorname{\boldsymbol{q}}_{i}^{T},

where $\boldsymbol{\Lambda}\coloneqq\operatorname{diag}(\lambda_{1},\dots,\lambda_{n})$ and $\operatorname{\boldsymbol{Q}}$ is the orthogonal matrix whose $i$ -th column is $\operatorname{\boldsymbol{q}}_{i}$ .

We will use the eigendecomposition of $\operatorname{\boldsymbol{\mathcal{L}}}$ to define the graph Fourier transform, with the eigenvectors $\operatorname{\boldsymbol{q}}_{1},\dots,\operatorname{\boldsymbol{q}}_{n}$ being interpreted as Fourier modes. The notion of oscillation on irregular domains like graphs is delicate, but can be reframed in terms of increasing variation of the modes, with the eigenvalues $0=\lambda_{1}\leqslant\lambda_{2}\leqslant\dots\leqslant\lambda_{n}\leqslant 2$ interpreted as (squared) frequencies.²²2This interpretation originates from motivating the graph Fourier transform via the combinatorial graph Laplacian $\operatorname{\boldsymbol{L}}=\sum_{i=1}^{n}\operatorname{\tilde{\lambda}}_{i}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}^{T}$ with $\operatorname{\tilde{\lambda}}_{i}=\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}^{T}\operatorname{\boldsymbol{L}}\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}=\sum_{i=1}^{n}\operatorname{\boldsymbol{W}}[i,j](\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}[i]-\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}[j])^{2}$ the variation of $\operatorname{\tilde{\operatorname{\boldsymbol{q}}}}_{i}$ . The Fourier transform of a graph signal $\operatorname{\boldsymbol{x}}\in\operatorname{\mathbb{R}}^{n}$ is defined by $\boldsymbol{\hat{x}}[i]\coloneqq\langle\operatorname{\boldsymbol{x}},\operatorname{\boldsymbol{q}}_{i}\rangle$ for $1\leq i\leq n$ and the inverse Fourier transform may be computed by $\operatorname{\boldsymbol{x}}=\sum_{i=1}^{n}\boldsymbol{\hat{x}}[i]\operatorname{\boldsymbol{q}}_{i}$ . It will frequently be convenient to write these equations in matrix form as $\boldsymbol{\hat{x}}=\operatorname{\boldsymbol{Q}}^{T}\operatorname{\boldsymbol{x}}$ and $\operatorname{\boldsymbol{x}}=\operatorname{\boldsymbol{Q}}\boldsymbol{\hat{x}}$ .

We recall that in the Euclidean setting, the convolution of two signals in the spatial domain corresponds to the pointwise multiplication of their Fourier transforms. Therefore, we may define the convolution of a signal $\operatorname{\boldsymbol{x}}$ with a filter $\operatorname{\boldsymbol{g}}$ by the rule that $\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}}$ is the unique vector such that $(\widehat{\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}}})[i]=\boldsymbol{\hat{g}}[i]\boldsymbol{\hat{x}}[i]$ for $1\leq i\leq n$ . Applying the inverse Fourier transform, one may verify that

\operatorname{\boldsymbol{g}}\star\operatorname{\boldsymbol{x}}=\sum_{i=1}^{n}\boldsymbol{\hat{g}}[i]\boldsymbol{\hat{x}}[i]\operatorname{\boldsymbol{q}}_{i}=\sum_{i=1}^{n}\boldsymbol{\hat{g}}[i]\boldsymbol{q}_{i}\operatorname{\boldsymbol{q}}^{T}_{i}\operatorname{\boldsymbol{x}}=\operatorname{\boldsymbol{Q}}\boldsymbol{\widehat{G}}\operatorname{\boldsymbol{Q}}^{T}\operatorname{\boldsymbol{x}},

(1)

where $\boldsymbol{\widehat{G}}\coloneqq\operatorname{diag}(\boldsymbol{\hat{g}})=\operatorname{diag}(\boldsymbol{\hat{g}}[1],\dots,\boldsymbol{\hat{g}}[n])$ . Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in $\boldsymbol{\widehat{G}}$ .

3.2 Spectral Graph Neural Network Constructions

A graph filter is a function $F:\mathbb{R}^{n\times d_{X}}\rightarrow\mathbb{R}^{n\times d_{Y}}$ that transforms a node feature matrix $\operatorname{\boldsymbol{X}}\in\mathbb{R}^{n\times d_{X}}$ into a new feature matrix $\operatorname{\boldsymbol{Y}}\in\mathbb{R}^{n\times d_{Y}}$ . GNNs typically feature several layers each of which produces a new set of features by filtering the output of the previous layer. We will usually let $\operatorname{\boldsymbol{X}}^{0}$ denote the initial node feature matrix, which is the input to the network, and let $\operatorname{\boldsymbol{X}}^{\ell}$ denote the node feature matrix after the $\ell$ -th layer.

In light of Eq. 1, a natural way to construct learned graph filters would be to directly learn the Fourier coefficients in $\boldsymbol{\hat{g}}$ . Indeed this was the approach used in the pioneering work of Bruna et al. [27]. However, this approach has several drawbacks. First, it results in $n$ learnable parameters in each convolutional filter. Therefore, a network using such filters would not scale well to large data sets due to the computational cost. At the same time, such filters are not necessarily well-localized in space and are prone to overfitting [28]. Moreover, networks of the form introduced in [27] typically cannot be generalized to different graphs [1]. However, recent work [29] has shown that this latter issue can be overcome by formulating the Fourier coefficients $\boldsymbol{\hat{g}}[i]$ as smooth functions of the Laplacian eigenvalues $\lambda_{i}$ , $1\leq i\leq n$ . In particular, this will be the case for the filters used in the networks considered in this work.

A common approach (e.g., used in [28, 5, 30, 31, 18]) to formulate such filters is by using polynomials of the Laplacian eigenvalues to set $\boldsymbol{\hat{g}}[i]\coloneqq\sum_{j=0}^{k}\theta_{j}\lambda_{i}^{j}$ (or equivalently $\boldsymbol{\widehat{G}}=\sum_{j=0}^{k}\theta_{j}\boldsymbol{\Lambda}^{j}$ ) for some $k\ll n$ . It can be verified [28] that this approach yields convolutional filters that are $k$ -localized in space and that can be written as $\operatorname{\boldsymbol{g}}_{\operatorname{\boldsymbol{\theta}}}\star\operatorname{\boldsymbol{x}}=\sum_{j=0}^{k}\theta_{j}\operatorname{\boldsymbol{\mathcal{L}}}^{j}\operatorname{\boldsymbol{x}}.$ This reduces the number of trainable parameters in each filter from $n$ to $k+1$ and allows one to perform convolution without explicitly computing the spectral decomposition of the Laplacian, which is expensive for large graphs.

One particularly noteworthy network that uses this method is [28], which writes the filters in terms of the Chebyshev polynomials defined by $T_{0}(x)=1$ , $T_{1}(x)=x$ and $T_{j}(x)\coloneqq 2xT_{j-1}(x)-T_{j-2}(x)$ . They first renormalize the eigenvalue matrix $\boldsymbol{\tilde{\Lambda}}\coloneqq 2\operatorname{\boldsymbol{\Lambda}}/\lambda_{n}-\operatorname{\boldsymbol{I}}_{n}$ and then define $\boldsymbol{\widehat{G}}\coloneqq\sum_{j=0}^{k}\theta_{j}T_{j}(\boldsymbol{\tilde{\Lambda}})$ . Letting $\boldsymbol{\tilde{\mathcal{L}}}\coloneqq 2\boldsymbol{\mathcal{L}}/\lambda_{n}-\operatorname{\boldsymbol{I}}_{n}$ , yields

\operatorname{\boldsymbol{g}}_{\operatorname{\boldsymbol{\theta}}}\star\operatorname{\boldsymbol{x}}=\sum_{j=0}^{k}\theta_{j}T_{j}(\boldsymbol{\tilde{\mathcal{L}}}).

(2)

3.3 Graph Convolutional Networks

One of the most widely used GNNs is the Graph Convolutional Network (GCN) [5]. This network is derived from the Chebyshev construction[28] mentioned above by setting $k=1$ in Eq. 2 and approximating $\lambda_{n}\approx 2$ , which yields

	$\displaystyle\operatorname{\boldsymbol{g}}_{\theta_{0},\theta_{1}}\star\operatorname{\boldsymbol{x}}$	$\displaystyle\approx\theta_{0}\operatorname{\boldsymbol{x}}+\theta_{1}(\operatorname{\boldsymbol{\mathcal{L}}}-\operatorname{\boldsymbol{I}}_{n})\operatorname{\boldsymbol{x}}$
		$\displaystyle=\theta_{0}\operatorname{\boldsymbol{x}}-\theta_{1}\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{x}}.$

To further reduce the number of trainable parameters, the authors then set $\theta\coloneqq\theta_{0}=-\theta_{1}$ . The resulting convolutional filter has the form

\boldsymbol{g}_{\theta}\star\boldsymbol{x}=\theta\left(\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}\right)\boldsymbol{x}.

(3)

One may verify that $\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}=2\operatorname{\boldsymbol{I}}_{n}-\boldsymbol{\mathcal{L}}$ , and therefore, Eq. 3 essentially corresponds to setting $\boldsymbol{\hat{g}}[i]\coloneqq\theta(2-\lambda_{i})$ in Eq. 1. The eigenvalues of $\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}$ take values in $[0,2].$ Thus, to avoid vanishing or exploding gradients, the authors use a renormalization trick

\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{D}^{-1/2}\boldsymbol{W}\boldsymbol{D}^{-1/2}\rightarrow\boldsymbol{\tilde{D}}^{-1/2}\boldsymbol{\tilde{W}}\boldsymbol{\tilde{D}}^{-1/2},

(4)

where $\boldsymbol{\tilde{W}}\coloneqq\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{W}$ and $\boldsymbol{\tilde{D}}$ is a diagonal matrix with $\boldsymbol{\tilde{D}}[v_{i},v_{i}]\coloneqq\sum_{j=1}^{n}\boldsymbol{\tilde{W}}[v_{i},v_{j}]$ for $i\in[n]$ . Setting $\boldsymbol{A}\coloneqq\boldsymbol{\tilde{D}}^{-1/2}\boldsymbol{\tilde{W}}\boldsymbol{\tilde{D}}^{-1/2},$ and using multiple channels we obtain a layer-wise propagation rule of the form $\operatorname{\boldsymbol{x}}_{j}^{\ell}=\sigma\big{(}\sum_{i=1}^{N_{\ell-1}}\theta_{ij}^{\ell}\operatorname{\boldsymbol{A}}\operatorname{\boldsymbol{x}}_{i}^{\ell-1}\big{)},$ where $N_{\ell}$ is the number of channels used in the $\ell$ -th layer and $\sigma(\cdot)$ is an elementwise nonlinearity. In matrix form we write

\operatorname{\boldsymbol{X}}^{\ell}=F_{\text{gcn}}\left(\operatorname{\boldsymbol{X}}^{\ell-1}\right)=\sigma\left(\operatorname{\boldsymbol{A}}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}\right).

(5)

We interpret the matrix $\operatorname{\boldsymbol{A}}$ as computing a localized average of each channel $\operatorname{\boldsymbol{x}}_{i}^{\ell-1}$ around each mode and the matrix $\operatorname{\boldsymbol{\Theta}}$ as sharing information across channels. This filter can also be observed at the node level as

\operatorname{\boldsymbol{X}}^{\ell}[v]=\sigma\left(\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\frac{1}{\sqrt{(d_{v}+1)(d_{w}+1)}}\operatorname{\boldsymbol{X}}^{\ell-1}[w]\operatorname{\boldsymbol{\Theta}}^{\ell}\right),

where $\operatorname{\mathcal{N}}_{v}$ denotes the one-step neighborhood of node $v$ and $\operatorname{\mathcal{N}}_{\underline{v}}\coloneqq\operatorname{\mathcal{N}}_{v}\cup\{v\}$ . This process can be split into three steps:


	$\displaystyle\operatorname{\boldsymbol{X}}_{a}^{\ell}[w]=\operatorname{\boldsymbol{X}}^{\ell-1}[w]\operatorname{\boldsymbol{\Theta}}^{\ell}\text{ for all }w\in\operatorname{\mathcal{N}}_{\underline{v}}$		(6a)
	$\displaystyle\operatorname{\boldsymbol{X}}_{b}^{\ell}[v]=\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\frac{1}{\sqrt{(d_{v}+1)(d_{w}+1)}}\operatorname{\boldsymbol{X}}_{a}^{\ell}[w]$		(6b)
	$\displaystyle\operatorname{\boldsymbol{X}}_{c}^{\ell}[v]=\sigma\left(\operatorname{\boldsymbol{X}}_{b}^{\ell}[v]\right),$		(6c)

which we refer to as the transformation step (Eq. 6a), the aggregation step (Eq. 6b) and the activation step (Eq. 6c).

As discussed earlier, the GCN filter described above may be viewed as a low-pass filter that suppresses high-frequency information. For simplicity, we focus on the convolution in Eq. 3 before the renormalization. This convolution essentially corresponds to point-wise Fourier multiplication by $\boldsymbol{\hat{g}}[i]=\theta(2-\lambda_{i})$ , which is strictly decreasing in $\lambda_{i}$ . Therefore, repeated applications of this filter effectively zero-out the higher frequencies. This is consistent with the oversmoothing problem discussed in [9].

3.4 Graph Attention Networks

Another popular network that is widely used for node classification tasks is the graph attention network (GAT) [17], which uses an attention mechanism to guide and adjust the aggregation of features from adjacent nodes. First, the node features $\operatorname{\boldsymbol{X}}^{\ell-1}$ are linearly transformed to $\boldsymbol{\bar{X}}^{\ell}\coloneqq\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}$ using a learned weight matrix $\operatorname{\boldsymbol{\Theta}}\in\mathbb{R}^{d\times d^{\prime}}$ . Then, the aggregation coefficients are learned via

\alpha_{v\leftarrow u}=\frac{\exp(\operatorname{LeakyReLU}({\left[\boldsymbol{\bar{X}}^{\ell}[v]\mathbin{\|}\boldsymbol{\bar{X}}^{\ell}[u]\right]\operatorname{\boldsymbol{a}}})}{\sum_{w\in\operatorname{\mathcal{N}}_{\underline{v}}}\exp(\operatorname{LeakyReLU}(\left[\boldsymbol{\bar{X}}^{\ell}[v]\mathbin{\|}\boldsymbol{\bar{X}}^{\ell}[w]\right]\operatorname{\boldsymbol{a}})},

where $\operatorname{\boldsymbol{a}}\in\operatorname{\mathbb{R}}^{2d^{\prime}}$ is a shared attention vector and $\|$ denotes horizontal concatenation. The output feature corresponding to a single attention head is given by $\operatorname{\boldsymbol{X}}^{\ell}[v]=\sigma(\sum_{u\in\operatorname{\mathcal{N}}_{v}}\alpha_{v\leftarrow u}\boldsymbol{\bar{X}}^{\ell}[u])$ . To increase the expressivity of this network, the authors then use a multi-headed attention mechanism, with $\Gamma$ heads, to generate concatenated features

\textstyle\operatorname{\boldsymbol{X}}^{\ell}[v]=\mathbin{\|}_{\gamma=1}^{\Gamma}\sigma\left({\sum_{u\in\operatorname{\mathcal{N}}_{\underline{v}}}}\alpha_{v\leftarrow u}^{\gamma}\boldsymbol{\bar{X}}_{\gamma}^{\ell}[u]\right).

(7)

3.5 Challenges in Geometric Deep Learning

Many GNN models, including GCN [5] and GAT [17], are subject to the so-called oversmoothing problem [9], caused by aggregation steps (such as Eq. 6b) that essentially consist of localized averaging operations. As discussed in Sec. 3.3 and also [10], from a signal processing point of view, this corresponds to a low-pass filtering of graph signals. Moreover, as discussed in [32], these networks are also subject to underreaching. Most GNNs (including GCN and GAT) can only relate information from nodes within a distance equal to the number of GNN layers, and because of the oversmoothing problem, they typically use a small number of layers in practice. Therefore, the oversmoothing and underreaching problems combine to significantly limit the ability of GNNs to capture long-range dependencies. In Sec. 4, we will introduce a hybrid network, which aims to address these challenges by using both GCN-style channels and channels based on the geometric scattering transform discussed below.

3.6 Geometric Scattering

In this section, we review the geometric scattering transform constructed in [12] for graph classification and show how it may be adapted for node-level tasks. As we shall see, this node-level geometric scattering will address the challenges discussed above in Sec. 3.5, by using band-pass filters that capture high-frequency information and have wider receptive fields.

The geometric scattering transform uses wavelets based upon raising the lazy random walk matrix

\boldsymbol{P}\coloneqq\frac{1}{2}\big{(}\operatorname{\boldsymbol{I}}_{n}+\boldsymbol{W}\boldsymbol{D}^{-1}\big{)},

to dyadic powers $2^{j}$ , which can be interpreted as differing degrees of resolution. The right eigenvectors and eigenvalues of $\operatorname{\boldsymbol{P}}$ are $\operatorname{\boldsymbol{u}}_{i}\coloneqq\operatorname{\boldsymbol{D}}^{1/2}\operatorname{\boldsymbol{q}}_{i}$ and $\omega_{i}\coloneqq 1-\lambda_{i}/2,1\leq i\leq n$ , respectively. Entrywise, we note that

(\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{X}})[v]=\frac{1}{2}\operatorname{\boldsymbol{X}}[v]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{v}}d_{w}^{-1}\operatorname{\boldsymbol{X}}[w].

(8)

Thus, $\operatorname{\boldsymbol{P}}$ may be viewed as a localized averaging operation operator analogous to those used in, e.g., GCN, and the powers $\operatorname{\boldsymbol{P}}^{2^{j}}$ may be viewed as low-pass filters which suppress high-frequencies. In order to better retain this high-frequency information, we define multiscale graph diffusion wavelets by subtracting these low-pass filters at different scales [16]. Specifically, for $k\in\mathbb{N}_{0}$ , we define a wavelet $\boldsymbol{\Psi}_{k}\in\mathbb{R}^{n\times n}$ at scale $2^{k}$ by

\operatorname{\boldsymbol{\Psi}}_{0}\coloneqq\operatorname{\boldsymbol{I}}_{n}-\operatorname{\boldsymbol{P}},\quad\operatorname{\boldsymbol{\Psi}}_{k}\coloneqq\operatorname{\boldsymbol{P}}^{2^{k-1}}-\operatorname{\boldsymbol{P}}^{2^{k}},\quad k\geq 1.

(9)

We may interpret each $\boldsymbol{\Psi}_{k}$ as capturing information at a different frequency band. From a spatial perspective, we may view $\operatorname{\boldsymbol{\Psi}}_{k}$ as encoding information on how a $2^{k}$ -step neighborhood differs from a smaller $2^{k-1}$ -step one. Such wavelets are usually organized in a filter bank

\{\boldsymbol{\Psi}_{k},\boldsymbol{\Phi}_{K}\}_{0\leq k\leq K},

(10)

along with a low-pass filter $\boldsymbol{\Phi}_{K}\coloneqq\boldsymbol{P}^{2^{K}}$ . Proposition 2.2 of [22] (restated in here as Proposition 1) shows that this filter bank is a self-adjoint, non-expansive frame on a suitably weighted inner-product space.

Refer to caption — Figure 1: Illustration of two-layer geom. scattering at the node level ( $\boldsymbol{U}(\boldsymbol{x})=\{\boldsymbol{U}_{p}\boldsymbol{x}:p\in\mathbb{N}_{0}^{m},m=0,1,2\}$ ) and at the graph level ( $\boldsymbol{S}(\boldsymbol{x})=\{\boldsymbol{S}_{p,q}\boldsymbol{x}:q\in\mathbb{N},p\in\mathbb{N}_{0}^{m},m=0,1,2\}$ ), extracted according to the wavelet cascade in Eq. 9-12.

The geometric scattering transform is a multi-layered architecture that iteratively applies wavelet convolutions and nonlinear activation functions in an alternating cascade as illustrated in Fig. 1. It is parameterized by paths $p\coloneqq(k_{1},\dots,k_{m})$ . Formally, we define

\boldsymbol{U}_{p}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{k_{m}}\circ\sigma\circ\boldsymbol{\Psi}_{k_{m-1}}\cdots\sigma\circ\boldsymbol{\Psi}_{k_{2}}\circ\sigma\circ\boldsymbol{\Psi}_{k_{1}}\boldsymbol{x},

(11)

where $\sigma$ is a nonlinear activation function.³³3In a slight deviation from previous work, here $\boldsymbol{U}_{p}$ does not include the outermost nonlinearity in the cascade. We note that in other work focusing on graph-classification, the scattering features are frequently aggregated into $q^{\text{th}}$ -order moments,

\textstyle\boldsymbol{S}_{p,q}\boldsymbol{x}\coloneqq\sum_{i=1}^{n}|(\operatorname{\boldsymbol{U}}_{p}\operatorname{\boldsymbol{x}})[v_{i}]|^{q}.

(12)

We also note that the nonlinearity $\sigma$ might vary in each step of the cascade. However, we will suppress this possible dependence to avoid cumbersome notation. We also note that in our theoretical results, if we assume, for example, that $\sigma$ is strictly increasing, this assumption is intended to apply to all nonlinearities used in the cascade. In our experiments, we use the absolute value, i.e., $\sigma(\operatorname{\boldsymbol{x}}[v_{i}])=\lvert\operatorname{\boldsymbol{x}}[v_{i}]\rvert$ .

The original formulations of geometric scattering were fully designed networks without any learned convolutions between channels. Here, we will incorporate learning by defining the following scattering propagation rule similar to the one used by GCN:

\operatorname{\boldsymbol{X}}^{\ell}=F_{p\text{-scat}}\left(\operatorname{\boldsymbol{X}}^{{\ell-1}}\right)=\sigma\left(\operatorname{\boldsymbol{U}}_{p}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}\big{)}\right).

(13)

Analogously to GCN, we note that for $v\in V$ , the scattering propagation rule can be decomposed into three steps:


	$\displaystyle\operatorname{\boldsymbol{X}}_{a^{\prime}}^{\ell}[v]=\operatorname{\boldsymbol{X}}^{\ell-1}[v]\operatorname{\boldsymbol{\Theta}}^{\ell}$		(14a)
	$\displaystyle\operatorname{\boldsymbol{X}}_{b^{\prime}}^{\ell}[v]=\left(\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}}_{a^{\prime}}^{\ell})\right)[v]$		(14b)
	$\displaystyle\operatorname{\boldsymbol{X}}_{c^{\prime}}^{\ell}[v]=\sigma(\operatorname{\boldsymbol{X}}_{b^{\prime}}^{\ell}[v])^{q}.$		(14c)

Importantly, we note that the scattering transform addresses the underreaching problem as wavelets $\boldsymbol{\Psi}_{k}=\operatorname{\boldsymbol{P}}^{2^{k-1}}-\operatorname{\boldsymbol{P}}^{2^{k}}$ that are leveraged in the aggregation step (Eq. 14b) have larger receptive fields than most traditional GNNs. However, scattering does not result in oversmoothing because the subtraction results in band-pass filters rather than low-pass filters. In this manner, the scattering transform addresses the challenges discussed in the previous subsection.

4 Hybrid Scattering Networks

Here, we introduce two hybrid networks that combine aspects of GCN and the geometric scattering transform discussed in the previous section. Our networks will use both low-pass and band-pass filters in different channels to capture different types of information. As a result, our hybrid networks will have greater expressive power than either traditional GCNs, which only use low-pass filters, or a pure scattering network, which only uses band-pass filters.

In our low-pass channels, we use modified GCN filters, which are similar to those used in Eq. 5 but use higher powers of $\operatorname{\boldsymbol{A}}$ and include bias terms. Specifically, we use a channel update rule of the form

\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\coloneqq\sigma\left(\operatorname{\boldsymbol{A}}^{r_{i}}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}_{\text{low},i}^{\ell}+\operatorname{\boldsymbol{B}}_{\text{low},i}^{\ell}\right)

(15)

for $1\leq i\leq C_{\text{low}}$ . We note, in particular, that the use of higher powers of $\operatorname{\boldsymbol{A}}$ enables a wider receptive field (of radius $r_{i}$ ), without increasing the number of trainable parameters (unlike in GCN).

Similarly, in our band-pass channels, we use a version of Eq. 13, with an added bias term, and our update rule is

\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\coloneqq\sigma\left(\operatorname{\boldsymbol{U}}_{p_{i}}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}_{\text{band},i}^{\ell}\big{)}+\operatorname{\boldsymbol{B}}_{\text{band},i}^{\ell}\right)

(16)

for $1\leq i\leq C_{\text{band}}$ . Here, similarly to Eq. 11, $p_{i}$ is a path that determines the cascade of wavelets used in the $i$ -th channel.

Aggregation module. Each hybrid layer uses $C_{\text{low}}$ low-pass and $C_{\text{band}}$ band-pass channels, described in Eq. 15 and Eq. 16, to transform the node features $\operatorname{\boldsymbol{X}}^{\ell-1}$ . The resulting $\{\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\}_{i=1}^{C_{\text{low}}}\text{ and }\{\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\}_{i=1}^{C_{\text{band}}}$ are aggregated to new $d_{\ell}$ -dimensional representations $\operatorname{\boldsymbol{X}}^{\ell}$ via an aggregation module such as those discussed in Sections 4.1 and 4.2.

\operatorname{\boldsymbol{X}}^{\ell}\coloneqq\operatorname{AGG}\left(\{\operatorname{\boldsymbol{X}}_{\text{low},i}^{\ell}\}_{i=1}^{C_{\text{low}}},\{\operatorname{\boldsymbol{X}}_{\text{band},i}^{\ell}\}_{i=1}^{C_{\text{band}}}\right).

(17)

Graph Residual Convolution. After aggregating the outputs of the low-pass channels and band-pass channels in Eq. 17, we apply the graph residual convolution, which acts as a low-pass filtering and aims to eliminate any high-frequency noise introduced by the scattering channels. This noise can arise, e.g., if there are various different label rates in different graph substructures. In this case, the scattering features may learn the difference between labeled and unlabeled nodes and thereby produce high-frequency noise.

This filter uses a modified diffusion matrix given by

\boldsymbol{A}_{\text{res}}(\alpha)=\frac{1}{\alpha+1}(\boldsymbol{I}_{n}+\alpha\boldsymbol{W}\boldsymbol{D}^{-1}),

where the hyperparameter $\alpha$ determines the magnitude of the low-pass filtering. Choosing $\alpha=0$ yields the identity (no filtering), while $\alpha\rightarrow\infty$ results in the random walk matrix $\boldsymbol{R}\coloneqq\boldsymbol{W}\boldsymbol{D}^{-1}$ . Thus, $\operatorname{\boldsymbol{A}}_{\text{res}}$ can be interpreted as lying between a completely lazy random walk that never moves and a non-resting one $\boldsymbol{R}$ that moves at every time step.

The full residual convolution update rule is given by

\operatorname{\boldsymbol{X}}^{\ell+1}\coloneqq\operatorname{\boldsymbol{A}}_{\text{res}}(\alpha)\operatorname{\boldsymbol{X}}^{\ell}\operatorname{\boldsymbol{\Theta}}_{\text{res}}+\operatorname{\boldsymbol{B}}_{\text{res}}.

The multiplication with $\operatorname{\boldsymbol{\Theta}}_{\text{res}}$ corresponds to a fully connected layer applied to the output from Eq. 17 (after filtering by $\operatorname{\boldsymbol{A}}_{\text{res}}$ ) with each neuron learning a linear combination of the signals output by the aggregation module.

4.1 Scattering GCN

The Scattering GCN (Sc-GCN) network was first introduced in [33]. Here, the aggregation module concatenates the filter responses horizontally yielding wide node representations $\operatorname{\boldsymbol{X}}^{\ell}$ of the form

[\operatorname{\boldsymbol{X}}_{\text{low},1}^{\ell}\mathbin{\|}\dots\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{low},C_{\text{low}}}^{\ell}\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{band},1}^{\ell}\mathbin{\|}\dots\mathbin{\|}\operatorname{\boldsymbol{X}}_{\text{band},C_{\text{band}}}^{\ell}].

(18)

Sc-GCN then learns relationships between the channels via the graph residual convolution.

Configuration. The primary goal of Sc-GCN is to alleviate oversmoothing in popular semi-supervised node-level tasks on, e.g., citation networks. As regularity patterns in such datasets are often dominated by low-frequency information such as inter-cluster node-similarities, we choose our parameters to focus on low-frequency information. We use three low-pass filters, with receptive fields of size (or radius) $r_{k}=1,2,3$ , and two band-pass filters. We use $\sigma(\cdot)\coloneqq|\cdot|$ as our nonlinearity in all steps except the outermost nonlinearity. Inspired by the aggregation step in classical geometric scattering[12], for the outermost nonlinearity, we additionally apply the $q^{\text{th}}$ power at the node level, i.e., $\sigma(\cdot)\coloneqq|\cdot|^{q}$ .

The paths $p$ and the parameter $\alpha$ from the graph residual convolution are tuned as hyperparameters of the network.

4.2 Geometric Scattering Attention Network

An important observation in Sc-GCN above is that the model decides globally about how to combine different channel information. The network first concatenates all of the features from the low-pass and band-pass channels in Eq. 18 and then combines these features via multiplication with the weight matrix $\operatorname{\boldsymbol{\Theta}}_{\text{res}}$ . However, for complex tasks or datasets, important regularity patterns may vary significantly across different graph regions. In such settings, a model should ideally attend locally over the aggregation and adaptively combine filter responses at different nodes.

This observation inspired the design of the Geometric Scattering Attention Network (GSAN) [34]. Drawing inspiration from [35], GSAN uses an aggregation module based on a multi-head node-level attention framework. However, the attention mechanism used in GSAN differs from [35] by attending over the combination of the different filter responses rather than over the combination of node features from neighboring nodes. We will first focus on the processing performed independently by each attention head and then discuss a multi-head configuration.

In a slight deviation from Sc-GCN, the weight matrices in Eq. 15 and Eq. 16 are shared across the filters of each attention head. Therefore, both aggregation steps (Eq. 6b and Eq. 14b) take the same input denoted by $\boldsymbol{\bar{X}}^{\ell}\coloneqq\operatorname{\boldsymbol{X}}^{\ell-1}\operatorname{\boldsymbol{\Theta}}^{\ell}$ . Next, we define $\boldsymbol{\bar{X}}_{\text{low},i}\coloneqq\operatorname{\boldsymbol{A}}^{r_{i}}\boldsymbol{\bar{X}}^{\ell}$ and $\boldsymbol{\bar{X}}_{\text{band},i}\coloneqq|\operatorname{\boldsymbol{U}}_{p_{i}}(\boldsymbol{\bar{X}}^{\ell})|$ to be the outputs of the aggregation steps, Eq. 6b and Eq. 14b, with the bias terms set to zero. We then compute attention score vectors $\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell},\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell}\in\operatorname{\mathbb{R}}^{n}$ that will be used to determine the importance of each of the filter responses for every single node. We calculate

\displaystyle\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell}

\displaystyle=\operatorname{LeakyReLU}\left(\left[\boldsymbol{\bar{X}}^{\ell}\|\boldsymbol{\bar{X}}_{\text{low},i}^{\ell}\right]\operatorname{\boldsymbol{a}}\right),

with analogous $\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell}$ and $\operatorname{\boldsymbol{a}}\in\operatorname{\mathbb{R}}^{2d_{\ell}}$ being a learned attention vector that is shared across all filters of the attention head. These attention scores are then normalized across all filters using the softmax function. Specifically, we define

\displaystyle\operatorname{\boldsymbol{\alpha}}_{\text{low},j}^{\ell}

\displaystyle\coloneqq\frac{\exp(\operatorname{\boldsymbol{e}}_{\text{low},j}^{\ell})}{\sum_{i=1}^{C_{\text{low}}}\exp(\operatorname{\boldsymbol{e}}_{\text{low},i}^{\ell})+\sum_{i=1}^{C_{\text{band}}}\exp(\operatorname{\boldsymbol{e}}_{\text{band},i}^{\ell})},

where the exponential function is applied elementwise, and define $\operatorname{\boldsymbol{\alpha}}_{\text{band},j}^{\ell}$ analogously. Finally, for every node, the filter responses are summed together, weighted by the corresponding (node-to-filter) attention score. We also normalize by the number of filters $C\coloneqq C_{\text{low}}+C_{\text{band}}$ , which gives

\displaystyle\operatorname{\boldsymbol{X}}^{\ell}=C^{-1}\tilde{\sigma}\bigg{(}

\displaystyle\sum_{j=1}^{C_{\text{low}}}\operatorname{\boldsymbol{\alpha}}_{\text{low},j}^{\ell}\odot\boldsymbol{\bar{X}}_{\text{low},j}^{\ell}+\sum_{j=1}^{C_{\text{band}}}\operatorname{\boldsymbol{\alpha}}_{\text{band},j}^{\ell}\odot\boldsymbol{\bar{X}}_{\text{band},j}^{\ell}\bigg{)},

where $\boldsymbol{\alpha}\odot\operatorname{\boldsymbol{X}}\coloneqq\operatorname{diag}(\boldsymbol{\alpha})\operatorname{\boldsymbol{X}}$ denotes the Hadamard (elementwise) product of $\boldsymbol{\alpha}$ with each column of $\operatorname{\boldsymbol{X}}$ . We further use $\tilde{\sigma}(\cdot)=\text{ReLU}(\cdot)$ in the equation above.

Multi-head Attention. Similar to other applications of attention mechanisms [35], we use multi-head attention to stabilize the training, thus yielding as output of the aggregation module (by a slight abuse of notation)

\operatorname{\boldsymbol{X}}^{\ell}=\|_{\gamma=1}^{\Gamma}\operatorname{\boldsymbol{X}}_{\gamma}^{\ell}\left(\operatorname{\boldsymbol{\Theta}}_{\gamma}^{\ell},\operatorname{\boldsymbol{\alpha}}_{\gamma}^{\ell}\right),

(19)

concatenating $\Gamma$ attention heads. As explained above, each attention head has individually trained parameters $\operatorname{\boldsymbol{\Theta}}_{\gamma}^{\ell}$ and $\operatorname{\boldsymbol{\alpha}}_{\gamma}^{\ell}$ and outputs a filter response $\operatorname{\boldsymbol{X}}_{\gamma}^{\ell}$ . Similar to Sc-GCN, this concatenation results in wide node representations that are further refined by the graph residual convolution.

Configuration. For GSAN, we set $C_{\text{low}}=C_{\text{band}}=3$ , giving the model balanced access to both low-pass and band-pass information. The aggregation process of the attention layer is shown in Fig. 3, where $\boldsymbol{U}_{1,2,3}$ represent three first-order scattering transformations with $\boldsymbol{U}_{1}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{1}x$ , $\boldsymbol{U}_{2}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{2}x$ and $\boldsymbol{U}_{3}\boldsymbol{x}\coloneqq\boldsymbol{\Psi}_{3}x$ . The number of attention heads $\Gamma$ and the parameter $\alpha$ from the graph residual convolution are tuned as hyperparameters of the network.

5 Theory of Hybrid Scattering Networks

In this section, we compare the descriptive power of the scattering transform to a class of GNNs defined below, which we refer to as aggregate-combine GNNs (AC-GNNs). We note that in practice, AC-GNNs are typically implemented with relatively narrow receptive fields in order to avoid oversmoothing. However, this leads to the underreaching problem, as discussed in Sections 1 and 3.5. The scattering transform, on the other hand, is able to utilize a large receptive field while avoiding oversmoothing. Therefore, it is able to simultaneously overcome both the oversmoothing and underreaching problems.

The family of AC-GNNs includes many popular GNN architectures such as GCN [5], GIN [36] and GraphSAGE [4]. However, importantly, we note that it does not include the scattering transform (because AC-GNNs only use local – one-step neighborhood – aggregations). We also note that this definition of an AC-GNN is quite similar to analogous notions used in, e.g., [32, 36].

Definition 1 (Aggregate-Combine GNN).

An aggregate-combine GNN (AC-GNN) is a network, where the update rule for each layer is defined on the node level according to

F_{\text{ac}}(\operatorname{\boldsymbol{X}})_{v}\coloneqq\sigma\left(\operatorname{com}\left(f(\operatorname{\boldsymbol{X}}_{v},d_{v}),\operatorname{agg}_{w\in\operatorname{\mathcal{N}}_{v}}g(\operatorname{\boldsymbol{X}}_{w},d_{v},d_{w})\right)\right),

(20)

for arbitrary functions $f:\operatorname{\mathbb{R}}^{1\times d}\times\operatorname{\mathbb{N}}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}}$ ,
$g:\operatorname{\mathbb{R}}^{1\times d}\times\operatorname{\mathbb{N}}^{2}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}}$ , $\operatorname{com}:\operatorname{\mathbb{R}}^{1\times d^{\prime}}\times\operatorname{\mathbb{R}}^{1\times d^{\prime}}\rightarrow\operatorname{\mathbb{R}}^{1\times d^{\prime}}$ , $\sigma:\operatorname{\mathbb{R}}\rightarrow\operatorname{\mathbb{R}}$ an (optional) activation function applied elementwise, and $\operatorname{agg}$ any function mapping a multi-set of vectors in $\operatorname{\mathbb{R}}^{1\times d^{\prime}}$ (i.e., $\operatorname{\{\!\!\{}\operatorname{\boldsymbol{a}}_{i}^{\top}\in\operatorname{\mathbb{R}}^{1\times d^{\prime}}\operatorname{\}\!\!\}}_{i\in\mathcal{I}}$ ) to a vector in $\operatorname{\mathbb{R}}^{1\times d^{\prime}}$ .

As discussed in Section 3.3, the aggregations used in many popular AC-GNNs effectively act as localized averaging operators and focus primarily on low-frequency information. Therefore, deep AC-GNNs have the undesirable property of oversmoothing the input features if too many layers are used. As a result, most AC-GNNs typically only use two or three layers in order to avoid the oversmoothing problem. To understand this at an intuitive level, we follow the lead of [37] and consider a simplified version of GCN in which $\sigma$ is the identity and the weight matrix is given by $\operatorname{\boldsymbol{\Theta}}=\frac{1}{2}\operatorname{\boldsymbol{I}}_{n}$ and also consider the network without the renormalization trick. In this case, the filter used in GCN is given by $\frac{1}{2}\left(\operatorname{\boldsymbol{I}}_{n}+\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\right)$ and so we have $\boldsymbol{\hat{g}}[i]=1-\lambda_{i}/2$ in Eq. 1. This implies that if $\operatorname{\boldsymbol{x}}$ is any vector which is orthogonal to the lead eigenvector, we have that the representation of $\operatorname{\boldsymbol{x}}$ after $\ell$ layers satisfies

\left\|\frac{1}{2}\left(\operatorname{\boldsymbol{I}}_{n}+\operatorname{\boldsymbol{D}}^{-1/2}\operatorname{\boldsymbol{W}}\operatorname{\boldsymbol{D}}^{-1/2}\right)^{\ell}\operatorname{\boldsymbol{x}}\right\|_{2}\leq\left(1-\frac{\lambda_{2}}{2}\right)^{\ell}\|\operatorname{\boldsymbol{x}}\|_{2}.

Therefore, the output of a deep (simplified) GCN will converge exponentially fast to the projection of $\operatorname{\boldsymbol{x}}$ onto the bottom eigenspace, which is why GCNs with many layers typically achieve poor performance.

By contrast, the use of band-pass filters allows to incorporate a larger receptive field without essentially projecting the input features onto the bottom eigenspace. Consider for example the following result from [22].

Proposition 1 (Proposition 2.2 of [22]).

The wavelet filter bank $\mathcal{W}=\{\operatorname{\boldsymbol{\Psi}}_{k},\operatorname{\boldsymbol{\Phi}}_{K}\}_{k=0}^{K}$ introduced in Eq. 10 is a non-expansive frame with respect to the weighted norm defined by $\|\operatorname{\boldsymbol{x}}\|_{\operatorname{\boldsymbol{D}}^{-1/2}}\coloneqq\sum_{i=1}^{n}|\operatorname{\boldsymbol{x}}[i]|^{2}d_{v_{i}}^{-1}$ whose lower-frame bound is a universal constant independent of $J$ and the graph topology. That is, there exists a universal constant $c_{1}>0$ such that

	$\displaystyle c_{1}\\|\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\sum_{k=0}^{K}$	$\displaystyle\\|\operatorname{\boldsymbol{\Psi}}_{k}\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}$
	$\displaystyle+$	$\displaystyle\\|\operatorname{\boldsymbol{\Phi}}_{K}\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\\|\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}.$

The fact that the lower bound $c_{1}$ does not depend on $J$ is important because the receptive field of an $L$ -layer geometric scattering transform is $2^{J}L$ . Therefore, if we choose $J$ to be sufficiently large, the scattering transform as formulated in [22] is able to incorporate long-range interaction in the network while still preserving a substantial portion of the input signal energy. Our construction differs slightly from [22] in that our wavelet filter bank includes only the $\operatorname{\boldsymbol{\Psi}}_{k},$ but not $\operatorname{\boldsymbol{\Phi}}_{K}$ . However, one may derive a result similar to Proposition 1, but where the lower bound $c_{1}$ depends on $\lambda_{2}$ (but still does not depend on $J$ ). We refer the reader to the proof of Proposition 4.1 of [38] for details.⁴⁴4While [38] does not consider weighted norms, this technicality can be readily handled by imitating the proof of Proposition 2.2 of [22].. Unlike the scattering transform, in order to avoid oversmoothing, most AC-GNNs typically use networks with small receptive fields. While this does help avoid the problem of oversmoothing, it unfortunately creates the problem of underreaching. In the remainder of this section, we will focus on demonstrating that this underreaching diminishes the power of the network to discriminate different nodes in certain situations.

We will now begin our analysis of node discriminability⁵⁵5Formal definitions are provided in Appendix. A.1, , i.e., the ability of a network to produce different representations of two nodes $v$ and $w$ in its hidden layers in the context of the underreaching problem. We will let $\operatorname{\mathcal{N}}_{\underline{v}}^{K}$ denote the $K$ -step node neighborhood⁵ of a node $v$ (including $v$ itself), and for $V_{S}\subset V$ we will let $G(V_{S})$ denote the corresponding induced subgraph.⁵ We will say two induced subgraphs $G(V_{S})$ and $G(V_{S^{\prime}})$ are isomorphic⁵ if they have identical geometric structure and write $G(V_{S})\cong^{\phi}G(V_{S^{\prime}})$ to indicate that $\phi$ is an isometry (geometry preserving map) between them. In the definition below, we introduce a class of graph-intrinsic node features that encode graph topology information. We will use these features as the input for GNN models in order to produce geometrically informed node representations.

Definition 2 (Intrinsic Node Features).

A nonzero node feature matrix $\operatorname{\boldsymbol{X}}$ is $K$ -intrinsic if for any $v,w\in V$ such that $G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K}\right)$ is isomorphic to $G\left(\operatorname{\mathcal{N}}_{\underline{w}}^{K}\right)$ , we have $\operatorname{\boldsymbol{X}}[v]=\operatorname{\boldsymbol{X}}[w]$ .

These intrinsic node features encode important $K$ -local geometric information in the sense that if $\operatorname{\boldsymbol{X}}[v]\neq\operatorname{\boldsymbol{X}}[w]$ , then the $K$ -step neighborhoods of $v$ and $w$ must have different geometric structure. To further understand this definition, we observe that the degree vector, or any (elementwise) function of it, is a one-intrinsic node feature matrix. Setting $\operatorname{\boldsymbol{X}}[v]$ to the average node degree of nodes in $\operatorname{\mathcal{N}}_{\underline{v}}^{K-1}$ yields $K$ -intrinsic node features. As a slightly more complicated example, features with $\operatorname{\boldsymbol{X}}[v]$ the number of triangles contained in $\operatorname{\mathcal{N}}_{\underline{v}}^{K}$ are also $K$ -intrinsic. Fig. 4 illustrates how such node features can help distinguish nodes from different graph structures.

As a concrete example, consider the task of predicting traffic on a network of Wikipedia articles, such as the Chameleon dataset [39], with nodes and edges corresponding to articles and hyperlinks. Here, intrinsic node features provide insights into the topological context of each node. A high degree suggests a widely cited article, which is likely to have a lot of traffic, while counting $k$ -cliques or considering average degrees from nearby nodes can shed light on the density of the local graph region.

The following theorem characterizes situations where AC-GNNs cannot discriminate nodes based on the information represented in intrinsic node features. It can be seen as a formal formulation of the underreaching problem [32]. We note that an analogous result can also be established for graph attention networks, e.g., GAT [35] and GATv2 [40].

Theorem 1.

Let $\phi:V\rightarrow V$ , $v\in V$ , and $K,L\in\mathbb{N}$ such that $G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}$ . Let $\operatorname{\boldsymbol{X}}^{0}$ be any $K$ -intrinsic node feature matrix and let $\operatorname{\boldsymbol{X}}^{L}$ be the output of an $L$ -layer AC-GNN, i.e., $F_{\text{ac}}^{L}\circ\dots\circ F_{\text{ac}}^{1}$ , with each layer as in Eq. 20. Then, $\operatorname{\boldsymbol{X}}^{L}[v]=\operatorname{\boldsymbol{X}}^{L}[\phi(v)]$ , and so $v$ and $\phi(v)$ cannot be discriminated based on the filtered node features $\operatorname{\boldsymbol{X}}^{L}$ from an $L$ -layer AC-GNN.

Theorem 1 is proved by using an induction argument to show that $\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)}$ for all $u\in\operatorname{\mathcal{N}}_{\underline{v}}^{L-\ell}$ and all $0\leq\ell\leq L$ . Full details can be found in the appendix A.2.

Next, we introduce a notion of structural difference that allows us to analyze situations where networks with wide receptive fields such as scattering networks have strictly greater discriminatory power than networks with smaller receptive fields.

Definition 3 (Structural Difference).

For two $\phi$ -isomorphic induced subgraphs $G(V_{S})\cong^{\phi}G(V_{S^{\prime}})$ , a structural difference relative to $\phi$ and a node feature matrix $\operatorname{\boldsymbol{X}}$ is manifested at $u\in V_{S}$ if $\operatorname{\boldsymbol{X}}[u]\neq\operatorname{\boldsymbol{X}}[\phi(u)]$ or $d_{u}\neq d_{\phi(u)}$ .⁶⁶6Note that $d_{u}\neq d_{\phi(u)}$ is only relevant for nodes in the boundary⁵ $\partial S$ because $d_{w}=d_{\phi(w)}$ for all interior⁵ nodes $w\in\operatorname{int}(S)$ , as the degree is 1-intrinsic. For $u\in\partial S$ , we further assume $d_{\phi(u)}\operatorname{\boldsymbol{X}}[u]\neq d_{u}\operatorname{\boldsymbol{X}}[\phi(u)]$ .

Notably, if the $K$ -step neighborhoods of two nodes $v$ and $\phi(v)$ are $\phi$ -isomorphic and $\operatorname{\boldsymbol{X}}$ is any $K$ -intrinsic node feature matrix, then no structural difference relative to $\phi$ and $\operatorname{\boldsymbol{X}}$ can be manifested at $v$ . Theorem 2 stated below shows that a scattering network will be able to produce different representations of two nodes $v$ and $\phi(v)$ if there are structural differences in their surrounding neighborhoods, assuming (among a few other mild assumptions) that a certain pathological case is avoided. The following notation and definition characterize this pathological situation that arises, roughly speaking, when two sums $\sum_{i}a_{i}$ and $\sum_{i}b_{i}$ coincide, even though $a_{i}$ and $b_{i}$ do not coincide. We note that although this condition seems complicated, it is highly unlikely to occur in practice.

Notation 1.

Let $\phi:V\rightarrow V$ , $v\in V$ , $D\in\mathbb{N}$ , such that $G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{D}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{D}\right)$ . Assume that there exists at least one node in $\operatorname{\mathcal{N}}_{\underline{v}}^{D}$ where a structural difference is manifested rel. to $\phi$ and node features $\operatorname{\boldsymbol{X}}$ . We define

V_{\text{diff}}\coloneqq\Big{\{}u\in\operatorname{\mathcal{N}}_{\underline{v}}^{D}:\text{ struc. difference rel. to $\phi$ and $\operatorname{\boldsymbol{X}}$ at }u\Big{\}}.

and we fix the following notations:

(i)

Let $d\coloneqq\min\{d(u,v):u\in V_{\text{diff}}\}$ and define the node set $V_{\text{diff}}^{d}\coloneqq\{u\in V_{\text{diff}}:d(u,v)=d\}$ . Note that these are exactly the nodes in $\operatorname{\mathcal{N}}_{v}^{d}$ where a structural difference is manifested relative to $\phi$ and $\operatorname{\boldsymbol{X}}^{0}$ .
(ii)

Let $\tau\in\mathbb{N}$ be the number of paths of minimal length between $v$ and nodes in $V_{\text{diff}}^{d}$ , and denote these by $p^{(i)}\coloneqq\left(u_{0}^{i},\dots,u_{d}^{i}=v\right)$ with $u_{0}^{i}\in V_{\text{diff}}^{d}$ , $1\leq i\leq\tau$ .
(iii)

Let $U_{j}\coloneqq\cup_{i=1}^{\tau}\{u_{j}^{i}\}$ , and define the generalized path from $V_{\text{diff}}^{d}$ to $v$ as $P\coloneqq(U_{0},U_{1},\dots,U_{d})$ .

(iv)

For $u\in U_{0}$ , define $\delta_{u}^{0}\coloneqq\operatorname{\boldsymbol{X}}^{0}[u]-\operatorname{\boldsymbol{X}}^{0}[\phi(u)]$ .
For $u\in U_{j}$ , $0\leq j=\leq d$ and $\operatorname{\boldsymbol{Y}}^{j}\coloneqq\operatorname{\boldsymbol{P}}^{j}\operatorname{\boldsymbol{X}}^{0}$ , define

\textstyle\delta_{u}^{j}\coloneqq\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j-1}_{w}-d_{\phi(w)}^{-1}\operatorname{\boldsymbol{Y}}^{j-1}_{\phi(w)}

and set

\Delta_{j}\coloneqq\{u\in U_{j}:\,\exists\,w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}\,\text{such that}\,\delta_{w}^{j-1}\neq 0\}.

Definition 4 (No Coincidental Correspondence).

Let $\operatorname{\mathcal{N}}_{\underline{v}}^{d}$ and $\Delta_{j}$ as in Notation 1. We say that $\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right)$ exhibits no coincidental correspondence (no-cc) if in each non-empty $\Delta_{j}$ , $1\leq j\leq d$ , there exists at least one $u$ such that $\delta_{u}^{j}\neq 0$ .

Theorem 2.

As in Theorem 1, let $\phi:V\rightarrow V$ , $v\in V$ , and $K,L\in\operatorname{\mathbb{N}}$ such that $G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}$ , and consider any K-intrinsic node feature matrix $\operatorname{\boldsymbol{X}}$ . Further assume that there exists at least one structural difference rel. to $\phi$ and $\operatorname{\boldsymbol{X}}$ in $\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}$ , and let $1\leq d\leq K+L$ be the smallest positive integer such that a structural difference is manifested in $\operatorname{\mathcal{N}}_{\underline{v}}^{d}$ . If the nonlinearity $\sigma$ is strictly monotonic and $\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right)$ exhibits no coincidental correspondence rel. to $\phi$ and $\operatorname{\boldsymbol{X}}$ , then one can define a scattering configuration $(p,\operatorname{\boldsymbol{\Theta}})$ such that scattering features $F_{p\text{-scat}}(\operatorname{\boldsymbol{X}})$ defined as in Eq. 13 discriminate $v$ and $\phi(v)$ .

Theorem 2 provides a large class of situations where scattering filters are able to discriminate between two nodes even if networks with smaller receptive fields are not. In particular, even if the $K+L$ step neighborhood of $v$ and $\phi(v)$ are isomorphic, it is possible for there to exist a $u$ in this $K+L$ step neighborhood such that a $K$ -intrinsic node feature takes different values at $u$ and $\phi(u)$ . Theorem 2 shows that scattering will produce differing representations of $v$ and $\phi(v)$ , while Theorem 1 shows that an $L$ -layer AC-GNNs like GCN will not. In the remarks below we will discuss minor variations of Theorem 2. In the appendix A.5, we will also present a modified version of Theorem 2, where the assumption of no coincidental correspondence is replaced by an assumption on the (generalized) path between the structural differences and the investigated node.

Remark 1.

The absolute value operation is not monotonic. Therefore, Theorem 2 cannot be directly applied in this case. However, the above result and all subsequent results can be extended to the case $\sigma=|\cdot|$ as long as certain pathological cases are avoided. Namely, Theorem 2 will remain valid as long as the features at $u$ are assumed not to be the negatives of the features at $\phi(u)$ , i.e., $\operatorname{\boldsymbol{X}}[u]\neq-\operatorname{\boldsymbol{X}}[\phi(u)]$ for nodes $u\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{v}}^{d}\right)$ , and similarly we must modify Definition 4 to avoid, for example, the case where $\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{w}^{-1}\operatorname{\boldsymbol{X}}^{j-1}[w]=-\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap U_{j-1}}d_{\phi(w)}^{-1}\operatorname{\boldsymbol{X}}^{j-1}[\phi(w)]$ . We note that while this condition is complex, it is rarely violated in practice.

Remark 2.

A result analogous to Theorem 2 is also valid for any permutation of the three steps Eq. 14a-14c (even with added bias terms as in Eq. 16). In particular, it applies to the update rule $\operatorname{\boldsymbol{X}}^{\ell}=\sigma\left(\operatorname{\boldsymbol{U}}_{p}\big{(}\operatorname{\boldsymbol{X}}^{\ell-1}\big{)}\operatorname{\boldsymbol{\Theta}}^{\ell}+\operatorname{\boldsymbol{B}}^{\ell}\right).$

We will provide a sketch of the proof of Theorem 2 here and provide details in the appendix. The key to the proof will be applying Lemma 1 stated below.

Lemma 1.

Let $\phi:V\rightarrow V$ , $\tilde{v}\in V$ , and $\tilde{D}\in\operatorname{\mathbb{N}}$ such that $G\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{D}}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(\tilde{v})}}^{\tilde{D}}\right)$ . Assume there exists at least one node in $\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{D}}$ where a structural difference is manifested relative to $\phi$ and to $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ , and let $1\leq\tilde{d}\leq\tilde{D}$ be the smallest positive integer such that a structural difference is manifested in $\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}$ . Assume that $\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right)$ exhibits no coincidental correspondence relative to $\phi$ and $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ , and let $\tilde{P}\coloneqq\left(\tilde{U}_{0},\tilde{U}_{1},\dots,\tilde{U}_{\tilde{d}}\right)$ be the generalized path as defined in Notation 1. Then, no structural difference is manifested in $\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-j}\setminus\tilde{U}_{j}$ , relative to $\phi$ and filtered node features $\operatorname{\boldsymbol{Y}}^{j}\coloneqq\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ and there exists at least one node in $u\in\tilde{U}_{j}$ and where a structural difference is manifested and $\delta^{j}_{u}\neq 0$ .

Below, we provide a sketch of the main ideas of the proof. A complete proof is provided in Appendix A.3.

Proof Sketch.

We use induction to show that at every step the matrix $\operatorname{\boldsymbol{P}}$ propagates the information about the structural difference one node layer closer towards $\tilde{v}$ .

By Notation 1(i), we have $\widetilde{U}_{0}=V_{\text{diff}}^{\tilde{d}}$ . Therefore, the case $j=0$ is immediate. In the inductive step, it suffices to show

\operatorname{\boldsymbol{Y}}^{j+1}[u]\neq\operatorname{\boldsymbol{Y}}^{j+1}[\phi(u)]

and $\delta_{u}^{j+1}\neq 0$ for at least one $u\in\tilde{U}_{j+1}$ and

\operatorname{\boldsymbol{Y}}^{j+1}[w]=\operatorname{\boldsymbol{Y}}^{j+1}[\phi(w)]

for all $w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1}$ under the assumption that these results are already valid for $\operatorname{\boldsymbol{Y}}^{j}$ . This can be established by writing $\operatorname{\boldsymbol{P}}^{j+1}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{Y}}^{j}$ and using the inductive hypothesis together with the definition of $\operatorname{\boldsymbol{P}}$ .∎

We now use Lemma 1 to sketch the proof of Theorem 2. The complete proof is provided in the appendix A.4.

Proof Sketch for Theorem 2.

We need to show that we can choose the parameters in a way that guarantees $F_{p\text{-}sct}(\operatorname{\boldsymbol{X}})[v]\neq F_{p\text{-}sct}(\operatorname{\boldsymbol{X}})[\phi(v)]$ . For simplicity, we set $\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n}$ . In this case, since $\sigma$ is strictly monotonic, and therefore injective, it suffices to show that we can construct $p$ with

\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[v]\neq\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[\phi(v)].

(21)

Using binary expansion, we may choose $k_{1},\ldots,k_{m}\in\mathbb{N}_{0}$ , $k_{i}<k_{i+1}$ , such that $d=2^{k_{1}}+\ldots+2^{k_{m}}$ and set $p\coloneqq(k_{1},\ldots,k_{m})$ . Given $p$ , we define truncated paths $p_{:i}\coloneqq(k_{1},\ldots,k_{i})$ and let $t_{i}\coloneqq\sum_{j=1}^{i}2^{k_{j}}$ for $1\leq i\leq m.$ We also let $p_{:0}$ be the empty path of length 0 and set $t_{0}=0.$

Recalling the generalized path $P=(U_{0},\ldots,U_{d})$ defined in Notation 1, we will use induction to show that for at least one node in $U_{t_{i}}$ a structural difference is manifested, while no structural difference manifests in $\mathcal{N}_{\underline{v}}^{d-t_{i}}\setminus U_{t_{i}}$ rel. to $\phi$ and $\operatorname{\boldsymbol{Z}}^{i}\coloneqq\operatorname{\boldsymbol{U}}_{p_{:i}}\operatorname{\boldsymbol{X}}$ for $0\leq i\leq m$ . Since $t_{m}=d$ and $p_{:m}=p$ , this will imply Eq. 21 and thus prove the theorem. As with Lemma 1, the base case $i=0$ follows from Notation 1(i). In the inductive step, it suffices to show that

\operatorname{\boldsymbol{Z}}^{i+1}[u]\neq\operatorname{\boldsymbol{Z}}^{i+1}[\phi(u)],

(22)

for at least one $u\in U_{t_{i+1}}$ and

\operatorname{\boldsymbol{Z}}^{i+1}[w]=\operatorname{\boldsymbol{Z}}^{i+1}[\phi(w)],

(23)

for all $w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}$ , under the assumption that these results are true for $i$ . Since $\operatorname{\boldsymbol{Z}}^{i+1}=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\big{(}\operatorname{\boldsymbol{Z}}^{i}\big{)}$ and $\sigma$ is increasing and therefore injective, it suffices to show that these results are preserved under multiplication by a wavelet matrix $\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}-\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}$ . This can be verified using Lemma 1 (with $\sigma(\operatorname{\boldsymbol{Z}}^{i})$ in place of $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ ) and the inductive hypothesis. ∎

Remark 3.

In the proof of Theorem 2, for the sake of concreteness, we chose $\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n}$ and $p$ such that $|p|=\sum_{j=1}^{m}2^{k_{i}}=d$ . However, inspecting the proof we see that our analysis may also be extended to paths with $|p|\geq d$ and also that the choice of $\boldsymbol{\Theta}$ does not matter as long as $\operatorname{\boldsymbol{\Theta}}$ is invertible. In particular, if the entries of $\operatorname{\boldsymbol{\Theta}}$ are generated i.i.d. at random from a continuous probability distribution, the conclusions of the theorem will hold with probability one[41]. We also note that Theorems 1 and 2 may be extended to the case where the update rule features a bias term $\operatorname{\boldsymbol{B}}$ as in Eq. 15 and 16 since $\operatorname{\boldsymbol{X}}\rightarrow\operatorname{\boldsymbol{X}}+\operatorname{\boldsymbol{B}}$ is injective.

Results analogous to Theorem 2 can likely also be proven for deep AC-GNNs with $d$ subsequent layers. The proof would closely follow the idea of Lemma 1 and would likely be slightly simpler than the proof of Theorem 2. Indeed, inspecting the proof of Theorem 2, we see that part of the reason we need both Eq. 22 and Eq. 23 is because the definition of $\operatorname{\boldsymbol{\Psi}}_{k}$ involves subtraction. Therefore, establishing Eq. 23 could be skipped for an AC-GNN. However, we would need considerably stronger assumptions when working with a $d$ -layer AC-GNN $F_{\text{ac}}^{d}\circ\dots\circ F_{\text{ac}}^{1}$ . Apart from an assumption analogous to no coincidental correspondence, we would additionally need injectivity of all functions $\sigma_{i},\operatorname{com}_{i},g_{i}$ that characterize $F_{\text{ac}}^{i},1\leq i\leq d$ , following Definition 1. Moreover, scattering filters are a more practical choice for two major reasons. Firstly, Scattering filters nicely balance the trade-off between oversmoothing and underreaching and are able to utilize a broad receptive field while still preserving information contained in higher frequencies as explained in Proposition 1 and the subsequent discussion. Secondly, scattering filters exhibit wide receptive fields with significantly fewer trained parameters.

6 Empirical Results

We now present our empirical results starting with semi-supervised node classification. We first show that our Sc-GCN model [33] is able to overcome the oversmoothing and underreaching problems and exhibits superior performance to either a pure GCN Network or a pure scattering network. We then show that we can further improve performance, particularly on complex datasets, by using GSAN[34], which incorporates an attention mechanism. Finally, we also show that our models perform well on graph-level tasks.

6.1 Semi-Supervised Node Classification

Scattering GCN. We compare the performance of Sc-GCN to both the original GCN [5] as well as the closely related ChebNet [28] and a network which uses Gaussian random fields to propagate label information [42]. Additionally, we compare against [13], which uses a pure-scattering approach. Further, we also compare Sc-GCN to two methods designed, in part, to address the oversmoothing problem. In [9], this problem is directly addressed by using partially absorbing random walks [43] to slow the mixing of node features in regions of the graph with high connectivity. The GAT graph attention network from [17], on the other hand, addresses this problem indirectly via an attention mechanism which trains an adaptive node-wise weighting of the smoothing operation. Lastly, we compare to an SVM classifier which ignores the graph structure.

In our experiments, we used four popular datasets summarized in Tab. I. For full details on these datasets, please see, e.g., [44] for Citeseer, Cora, and Pubmed and [45] for DBLP. As discussed in [9], the oversmoothing problem is most acute in highly connected networks where information mixes rapidly and nodes quickly become indistinguishable. Therefore, we included networks with different sizes and connectivity structures and order the datasets by connectivity structure (with the most connected on the right). Consistent with our expectation, Tab. II and Fig. 5 show that the advantages of our network are most significant for the networks with high connectivity.

We implemented all of these methods based upon the code made available by the original authors, and, for fair comparison, we tuned and evaluated each of these methods using the standard splits provided for benchmark datasets. Whenever possible, we made sure that our results are consistent with previously reported accuracies. We tuned our method, both the hyperparameters and number of scattering and GCN channels, via a grid search (over the same set for all datasets), and we used the same cross-validation procedure for our method and the competing methods. In Appendix B, we provide further details and also perform an ablation study evaluating the importance of each component in our proposed architecture.

TABLE I: Dataset characteristics: number of nodes, edges, and features; mean

\pm

std. of node degrees; ratio of #edges to #nodes.

Dataset	Nodes	Edges	Features	Degrees	$\frac{\text{Edges}}{\text{Nodes}}$
Citeseer	3,327	4,732	3,703	3.77 $\pm$ 3.38	1.42
Cora	2,708	5,429	1,433	4.90 $\pm$ 5.22	2.00
Pubmed	19,717	44,338	500	5.50 $\pm$ 7.43	2.25
DBLP	17,716	52,867	1639	6.97 $\pm$ 9.35	2.98

TABLE II: Classification accuracy (top two marked in bold; best one underlined) of Scattering GCN on four benchmark datasets (ordered by increasing connectivity) compared to four other GNNs [17, 9, 5, 28], a non-GNN approach [42] based on belief propagation, a pure graph scattering baseline [13], and a nongeometric baseline only using node features with linear SVM.

Model	Citeseer	Cora	Pubmed	DBLP
Sc-GCN (ours)	71.7	84.2	79.4	81.5
GAT [17]	72.5	83.0	79.0	66.1
Partially absorbing [9]	71.2	81.7	79.2	56.9
GCN [5]	70.3	81.5	79.0	72.0
Chebyshev [28]	69.8	78.1	74.4	57.3
Label Propagation [42]	58.2	77.3	71.0	53.0
Graph scattering [13]	67.5	81.9	69.8	69.4
Node features (SVM)	61.1	58.0	49.9	48.2

We report test classification accuracy in Tab. II, which shows that our approach outperforms other methods on three out of the four considered datasets. Only on Citeseer, we are slightly outperformed by GAT. However, this is the dataset with the weakest connectivity structure (see Tab. I), while at the same time having the most informative node features (achieving 61.1% accuracy via linear SVM without considering graph information). In contrast, on DBLP, which exhibits the richest connectivity structure and least informative features (only 48.2% SVM accuracy), we significantly outperform GAT (by over 15% improvement). Notably, GAT itself significantly outperforms all other methods (by 6.8% or more) apart from the graph scattering baseline [13].

The task of semi-supervised learning is motivated by the fact that in real-world applications one typically only has labels for a small fraction of the nodes. Therefore, we next analyze the performance of our network with limited amounts of training data. Fig. 5 (top) plots the training accuracy of Sc-GCN, as well as the other networks (except the baselines), as the amount of training data varies from 20% to 100% of its original amount. Across all networks, we see that Sc-GCN exhibits the best performance when the training data is sufficiently limited. Moreover, we see that it outperforms GAT, which is the next best performing network in terms of overall accuracy (see Tab. II), on all datasets whenever the training data is reduced to 60% of its original size or less.

In Fig. 5 (bottom), we plot the evolution of the validation error during the training process. We note that Sc-GCN is able to reach a low training error significantly faster than GAT, which is the next most competitive in terms of overall accuracy. Several other methods do achieve low validation errors faster on several datasets, but their final accuracy is much lower than that of Sc-GCN. Moreover, on Pubmed, which is the largest dataset, Sc-GCN achieves a low validation error at about the same rate as GCN and at a significantly faster rate than all other methods.

TABLE III: Dataset characteristics & comparison of node classification test accuracy. Datasets are ordered by increasing homophily.

Dataset	Classes	Nodes	Edges	Homophily	GCN	GAT	Sc-GCN	GSAN
Texas	5	183	295	0.11	59.5	58.4	60.3	60.5
Chameleon	5	2,277	31,421	0.23	28.2	42.9	51.2	61.2
CoraFull	70	19,793	63,421	0.57	62.2	51.9	62.5	64.5
Wiki-CS	10	11,701	216,123	0.65	77.2	77.7	78.1	78.6
Citeseer	6	3,327	4,676	0.74	70.3	72.5	71.7	71.3
Pubmed	3	19,717	44,327	0.80	79.0	79.0	79.4	79.8
Cora	7	2,708	5,276	0.81	81.5	83.0	84.2	84.0
DBLP	4	17,716	52,867	0.83	72.0	66.1	81.5	84.3

Geometric Scattering Attention Network. Having established the practical utility of the hybrid network Sc-GCN above, we next show that performance may be further improved by using GSAN, which builds upon Sc-GCN by incorporating an attention mechanism. We evaluate performance again for semi-supervised node classification and compare to two popular models (namely GCN [5] and GAT [17]), as well as the original Sc-GCN. We extend our experiments to eight benchmark datasets that we order according to increasing homophily, as shown in Tab. III. The homophily of a graph is quantified by the fraction of edges that connect nodes that share the same label. High homophily indicates a relatively smooth label distribution, while low homophily suggests a more complex labeling pattern. Texas and Chameleon are low-homophily datasets where nodes correspond to webpages and edges to links between them, with classes corresponding to webpage topic or monthly traffic (discretized into five levels), respectively [46]. Wiki-CS consists of nodes that represent computer science articles with the edges representing the hyperlinks [47]. CoraFull is the larger version of the Cora dataset [48]. The remaining four datasets (i.e., Cora, Citeseer, Pubmed, DBLP) are the citation networks that we already studied when evaluating Sc-GCN above. Notably, compared to the added datasets, the latter four exhibit relatively high homophily.

We split the datasets into train, validation and testing sets. The hyperparameters (number of attention heads $\Gamma$ , residual parameter $\alpha$ and channel widths) are tuned via grid search using the validation set.

The results in Tab. III show that both Sc-GCN and GSAN outperform GCN and GAT on seven out of eight datasets. The advantages of GSAN over Sc-GCN are particularly notable on the medium-size and large datasets Chameleon and CoraFull that exhibit relatively low-homophily. The most striking result is for Chameleon. Here, Sc-GCN clearly outperforms GCN and GAT (by at least 8.2%), but GSAN considerably improves performance even more (by additional 10%).

Analyzing the node-wise attention weights (see Sec. 4.2) allows us to understand the importance of different channels for different datasets. We consider here the ratio between attention assigned to band-pass and low-pass channels (over all attention heads). For that, we first sum up attention over low-pass and band-pass channels, respectively, i.e., $\operatorname{\boldsymbol{\alpha}}_{\text{low}}\coloneqq\sum_{\gamma}\sum_{i}\boldsymbol{\alpha}_{\text{low},i}^{\gamma}$ and $\operatorname{\boldsymbol{\alpha}}_{\text{band}}\coloneqq\sum_{\gamma}\sum_{i}\boldsymbol{\alpha}_{\text{band},i}^{\gamma}$ . Then, for each node $v$ , we calculate the ratio $\zeta_{v}\coloneqq\operatorname{\boldsymbol{\alpha}}_{\text{band}}[v]/\operatorname{\boldsymbol{\alpha}}_{\text{low}}[v]$ . In Fig. 6, we present the distributions of $\{\zeta_{v}:v\in V\}$ for four datasets.

Examining the distributions, we observe four different dataset profiles. Citeseer and WikiCS exhibit minimal spread of attention ratios. In Citeseer, most of the attention is given to the low-pass channels, while WikiCS shows very balanced channel usage. In contrast, DBLP and Chameleon exhibit large spreads. Although the majority of nodes in both datasets values low-pass filters more, some nodes in Wiki-CS pay up to five times the attention to band-pass channels compared to low-pass ones (and vice versa). Interestingly, the distribution for Chameleon has two peaks, suggesting two node populations that require different band-information, which we interpret to be a driving factor for the improvement of GSAN on Chameleon (see Tab. III).

6.2 Graph Classification and Regression

Graph Classification. We use COLLAB and IMBD-BINARY as benchmark datasets. These are ego networks extracted from scientific collaborations and movie collaborations, respectively [49]. COLLAB contains 5,000 graphs with three classes and IMBD-BINARY contains 1,000 graphs with two classes. We use an 80-10-10 split between training, validation and test. The datasets contain graph structures but no associated graph signals. In our work, we compute the eccentricity (for connected graphs), degree and clustering coefficient of each vertex, and use these as input signals to our network.

We use a one-layer GSAN with 8 heads and 16 hidden units for COLLAB and one-layer GSAN with 4 heads and 16 hidden units for IMBD-BINARY, followed by one graph residual convolutional layer. We then apply a Set2Set module [50] followed by a multi-layer perceptron (MLP). Set2Set is a graph pooling architecture[51] where we apply an LSTM-based module for $k$ processing steps iteratively on the graph signal. During each processing step, node-wise feature signals with variable graph size are integrated using an LSTM-based attention mechanism. In our model, the GSAN-based layer outputs a graph signal $\operatorname{\boldsymbol{X}}\in\operatorname{\mathbb{R}}^{n\times d}$ and the Set2Set layer outputs a graph level feature vector $Set2Set(\operatorname{\boldsymbol{X}})\in\operatorname{\mathbb{R}}^{2d}$ for each graph. After the Set2Set layer, we apply an MLP layer $\operatorname{\mathbb{R}}^{2d}\rightarrow\operatorname{\mathbb{R}}$ on the graph level signal for the classification task. We use a Set2Set layer with 3 processing steps and a two-layer MLP for graph pooling. We also provide a pure-scattering baseline (Sc-only) based on the approach proposed in [13] together with an SVM classifier. Our results in Tab. IV show that both proposed hybrid scattering networks show improvement over the compared baselines.

TABLE IV: Comparison of graph classification test accuracy (higher is better).

Dataset	GCN	GAT	Sc-only	Sc-GCN	GSAN
COLLAB	0.592	0.523	0.640	0.690	0.704
IMBD-BINARY	0.710	0.632	0.710	0.740	0.760

TABLE V: Comparison of graph regression test error (lower is better).

Dataset	GCN	GAT	Sc-only	Sc-GCN	GSAN
ZINC	0.469	0.463	0.51	0.452	0.430
Lipophilcity	1.05	0.950	1.19	1.03	1.02

Graph Regression. We use ZINC and Lipophilicity as benchmark datasets [52, 53]. For the ZINC dataset, the target is the penalized water-octanol partition coefficient of molecules. For the Lipophilcity dataset, the task is predicting the experimental octanol/water distribution coefficient of different molecules. ZINC contains 1,000 graphs and Lipophilcity contains 4,200 graphs. Graphs in ZINC have 75 input features and graphs in Lipophilcity have 9. We use a 80-10-10 split between training, validation and testing.

We again use a GSAN-based approach, this time consisting of two geometric scattering attention layers followed by one graph residual convolution layer. For ZINC, we use a 32-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 128 hidden units as the second layer. For Lipophilcity, we use a 4-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 96 hidden units as the second layer. The aggregation module uses an MLP, applied to the output node representations, producing a scalar output, followed by an average pooling on each molecule. Again, we provide a scattering-only baseline (Sc-only), where we used a modified version of GSAN, but with all low-pass channels removed, so that our network utilizes only scattering channels. Our results are presented in Tab. V, demonstrating the utility of our GSAN approach for this task as well.

7 Conclusion

Here, we studied GNN approaches for semi-supervised node classification and investigate some of the major limitations of today’s architectures. Many popular models are known to essentially rely on the low-pass filtering of graph signals. Based on this observation, we propose a novel hybrid GNN framework that can leverage higher frequency information not captured by traditional GCN models. Our construction is based on geometric scattering, a concept that was previously used mainly for graph classification, which we adapt to the node level. Our theoretical study suggests that scattering filters nicely balance the trade-off between oversmoothing and underreaching. Therefore, we are able to theoretically establish that the resulting scattering filters are more sensitive to the graph topology than a large class or GNN architectures including GCN when used in conjunction with node features that encode graph structure information. Empirically, we first evaluated our Sc-GCN model and demonstrated its efficacy in alleviating so-called oversmoothing. We then evaluated our GSAN model, which further improves performance, particularly in more complex (low-homophily) settings, via an attention framework. We also provide evidence that the proposed hybrid scattering networks perform well in graph-level tasks, both in classification and regression. In future work, one might further explore the potential of features that carry graph structure information empirically, and analyze more datasets in the context of graph-level tasks.

Acknowledgments

The authors would like to thank Dongmian Zou for fruitful discussions. This work was partially funded by Fin-ML CREATE graduate studies scholarship for PhD [F.W.]; IVADO (Institut de valorisation des données) grant PRF-2019-3583139727, FRQNT (Fonds de recherche du Québec - Nature et technologies) grant 299376, Canada CIFAR AI Chair [G.W.]; NSF (National Science Foundation) grant DMS-1845856 [M.H.]; and NIH (National Institutes of Health) grant NIGMS-R01GM135929 [M.H.,G.W.]. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

References

[1] M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
[2] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv:2104.13478, 2021.
[3] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
[4] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
[5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2016.
[6] A. M. Fout, “Protein interface prediction using graph convolutional networks,” Ph.D. dissertation, Colorado St. Univ., 2017.
[7] N. De Cao and T. Kipf, “Molgan: An implicit generative model for small molecular graphs,” arXiv:1805.11973, 2018.
[8] B. Knyazev, X. Lin, M. R. Amer, and G. W. Taylor, “Spectral multigraph networks for discovering and fusing relationships in molecules,” arXiv:1811.09595, 2018.
[9] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). Association for the Advancement of Artificial Intelligence, 2018, pp. 3538–3545.
[10] H. Nt and T. Maehara, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv:1905.09550, 2019.
[11] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” in International Conference on Learning Representations, 2019.
[12] F. Gao, G. Wolf, and M. Hirn, “Geometric scattering for graph data analysis,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2122–2131.
[13] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,” Applied and Computational Harmonic Analysis, vol. 49, no. 3, pp. 1046–1074, 2020.
[14] S. Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
[15] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
[16] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 53–94, 2006.
[17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
[18] R. Liao, Z. Zhao, R. Urtasun, and R. Zemel, “Lanczosnet: Multi-scale deep graph convolutional networks,” in the 7th International Conference on Learning Representations (ICLR), 2019.
[19] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing,” in international conference on machine learning. PMLR, 2019, pp. 21–29.
[20] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” arXiv:1904.07785, 2019.
[21] F. Gama, A. Ribeiro, and J. Bruna, “Stability of graph scattering transforms,” in NeurIPS, 2019, pp. 8038–8048.
[22] M. Perlmutter, F. Gao, G. Wolf, and M. Hirn, “Understanding graph neural networks with asymmetric geometric scattering transforms,” arXiv:1911.06253, 2019.
[23] D. Zou and G. Lerman, “Encoding robust representation for graph generation,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–9.
[24] E. Castro, A. Benz, A. Tong, G. Wolf, and S. Krishnaswamy, “Uncovering the folding landscape of rna secondary structure using deep graph embeddings,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 4519–4528.
[25] D. Bhaskar, J. D. Grady, M. A. Perlmutter, and S. Krishnaswamy, “Molecular graph generation via geometric scattering,” arXiv:2110.06241, 2021.
[26] A. Tong, F. Wenkel, K. MacDonald, S. Krishnaswamy, and G. Wolf, “Data-driven learning of geometric scattering networks,” arXiv:2010.02415, 2020.
[27] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR 2014, 2014.
[28] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 3844–3852.
[29] R. Levie, W. Huang, L. Bucci, M. Bronstein, and G. Kutyniok, “Transferability of spectral graph convolutional neural networks,” Journal of Machine Learning Research, vol. 22, no. 272, pp. 1–59, 2021.
[30] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Accelerated filtering on graphs using Lanczos method,” 2015, arXiv:1509.04537.
[31] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.
[32] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva, “The logical expressiveness of graph neural networks,” in International Conference on Learning Representations, 2020.
[33] Y. Min, F. Wenkel, and G. Wolf, “Scattering gcn: Overcoming oversmoothness in graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[34] ——, “Geometric scattering attention networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8518–8522.
[35] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in the 6th International Conference on Learning Representations (ICLR), 2018.
[36] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2018.
[37] H. Wang and J. Leskovec, “Unifying graph convolutional neural networks and label propagation,” arXiv:2002.06755, 2020.
[38] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” arXiv:1806.08829, 2018.
[39] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” Journal of Complex Networks, vol. 9, no. 2, p. cnab014, 2021.
[40] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” in International Conference on Learning Representations, 2021.
[41] M. Rudelson, “Invertibility of random matrices: Norm of the inverse,” ANNALS OF MATHEMATICS, 2008.
[42] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919.
[43] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S. fu Chang, “Learning with partially absorbing random walks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., 2012, pp. 3077–3085.
[44] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning. PMLR, 2016, pp. 40–48.
[45] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” Network, vol. 11, no. 9, p. 12, 2016.
[46] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” arXiv:2002.05287, 2020.
[47] P. Mernyei and C. Cangea, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv:2007.02901, 2020.
[48] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in International Conference on Learning Representations, 2018, pp. 1–13.
[49] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
[50] O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” arXiv:1511.06391, 2015.
[51] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[52] J. J. Irwin and B. K. Shoichet, “Zinc- a free database of commercially available compounds for virtual screening,” Journal of chemical information and modeling, vol. 45, no. 1, pp. 177–182, 2005.
[53] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513–530, 2018.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.

Appendix A

A.1 Supplemental Definitions for Sec. 5

Here, we provide additional definitions and notations.

Definition 5 (Node Discriminability).

Let $\operatorname{\boldsymbol{X}}$ be a node feature matrix, and let $\operatorname{\boldsymbol{Y}}=F(\operatorname{\boldsymbol{X}})$ for some graph filter $F:\mathbb{R}^{n\times d_{X}}\rightarrow\mathbb{R}^{n\times d_{Y}}$ . We say that $F$ discriminates two nodes $v,w\in V$ based on $\operatorname{\boldsymbol{X}}$ if $\operatorname{\boldsymbol{Y}}[v]\neq\operatorname{\boldsymbol{Y}}[w]$ .

Definition 6 (K-step Neighborhood).

For $K\in\mathbb{N}_{0}$ ,⁷⁷7We use the conventions $\operatorname{\mathbb{N}}\coloneqq\{1,2,3,\dots\}$ , $\operatorname{\mathbb{N}}_{0}\coloneqq\operatorname{\mathbb{N}}\cup\{0\}$ and $\operatorname{\mathbb{N}}_{\geq k}\coloneqq\{n\in\operatorname{\mathbb{N}}:n\geq k\}$ , for $k\in\operatorname{\mathbb{N}}$ . we define the K-step neighborhood of $v$ as $\operatorname{\mathcal{N}}_{v}^{K}\coloneqq\{u\in V:1\leq d(u,v)\leq K\}$ (in particular $\operatorname{\mathcal{N}}_{v}^{0}=\emptyset$ and $\operatorname{\mathcal{N}}_{v}^{1}=\operatorname{\mathcal{N}}_{v}$ ). We further write $\operatorname{\mathcal{N}}_{\underline{v}}^{K}\coloneqq\operatorname{\mathcal{N}}_{v}^{K}\cup\{v\}$ (in particular $\operatorname{\mathcal{N}}_{\underline{v}}^{0}=\{v\}$ ).

Definition 7 (Induced Subgraph).

Given a set $V_{S}\subseteq V$ , we refer to a subgraph $S=(V_{S},E_{S})$ of $G$ as an induced subgraph, if $E_{S}\subset E$ is the set of all edges that connect nodes from $V_{S}$ , i.e., $E_{S}=E(V_{S})\coloneqq\{e\in E:e=\{v,w\}\in V_{S}\times V_{S}\}$ . In this case, we will write $S\coloneqq G(V_{S})$ .

Definition 8 (Graph Isomorphism).

A graph isomorphism between graphs $G=(V,E)$ and $G^{\prime}=(V^{\prime},E^{\prime})$ is a bijection $\phi:V\rightarrow V^{\prime}$ such that $\{v,w\}\in E$ if and only if $\{\phi(v),\phi(w)\}\in E^{\prime}$ . We say that $G$ and $G^{\prime}$ are $\phi$ -isomorphic and write $G\cong^{\phi}G^{\prime}$ . We also define $\phi(A)\coloneqq\{\phi(w):w\in A\}$ for $A\subset V$ .

Definition 9 (Boundary and Interior of Induced Subgraph).

Let $S=(V_{S},E_{S})$ be a an induced subgraph of $G=(V,E)$ . We define the boundary of $S$ as $\partial S\coloneqq\{w\in V_{S}:d(w,V\setminus V_{S})=1\}.$ Further, we define the interior of $S$ as $\operatorname{int}(S)\coloneqq V_{S}\setminus\partial S\subset V_{S}$ . When convenient, we will also use the notation $\partial V_{S}\coloneqq\partial S$ and $\operatorname{int}(V_{S})\coloneqq\operatorname{int}(S)$ .

Definition 10 (Spatial Support of Graph Filter).

We say that a graph filter $F$ has spatial support $L\in\operatorname{\mathbb{N}}_{0}$ if the value of $F(\operatorname{\boldsymbol{X}})[v]$ depends only on values of $\operatorname{\boldsymbol{X}}$ at nodes within $L$ steps of $v$ , i.e., if for any $v\in V$ and any node feature matrices $\operatorname{\boldsymbol{X}}_{1}$ and $\operatorname{\boldsymbol{X}}_{2}$ , we have $F(\operatorname{\boldsymbol{X}}_{1})[v]=F(\operatorname{\boldsymbol{X}}_{2})[v]$ whenever $\operatorname{\boldsymbol{X}}_{1}[w]=\operatorname{\boldsymbol{X}}_{2}[w]$ for all $w\in\mathcal{N}_{\underline{v}}^{L}$ .

A.2 Proof of Theorem 1

We use induction to show that $\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)}$ for all $u\in\operatorname{\mathcal{N}}_{\underline{v}}^{L-\ell}$ and all $0\leq\ell\leq L$ .

$\ell=0$ : Let $u\in\operatorname{\mathcal{N}}_{\underline{v}}^{L}$ . By definition, we have $d(u,v)\leq L,$ and so the triangle inequality implies that $\operatorname{\mathcal{N}}_{\underline{u}}^{K}\subseteq\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}.$ This implies that $G\left(\operatorname{\mathcal{N}}_{\underline{u}}^{K}\right)\cong^{\phi}G\left(\operatorname{\mathcal{N}}_{\underline{\phi(u)}}^{K}\right)$ and therefore the assumption that $\operatorname{\boldsymbol{X}}^{0}$ is $K$ -intrinsic implies that $\operatorname{\boldsymbol{X}}^{0}_{u}=\operatorname{\boldsymbol{X}}^{0}_{\phi(u)}$ .

$\ell\rightarrow\ell+1$ : Now assume that $\operatorname{\boldsymbol{X}}^{\ell}_{u}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(u)}$ for all $u\in\mathcal{N}_{\underline{v}}^{L-\ell}$ , and let $w\in\mathcal{N}_{\underline{v}}^{L-(\ell+1)}$ . Since $\mathcal{N}_{\underline{v}}^{L-(\ell+1)}\subseteq\mathcal{N}_{\underline{v}}^{L-\ell}$ , we have $\operatorname{\boldsymbol{X}}^{\ell}_{w}=\operatorname{\boldsymbol{X}}^{\ell}_{\phi(w)}$ . Moreover, since the degree is 1-intrinsic, we have $d_{w}=d_{\phi(w)},$ and therefore $f(\operatorname{\boldsymbol{X}}^{\ell}_{w},d_{w})=f(\operatorname{\boldsymbol{X}}^{\ell}_{\phi(w)},d_{\phi(w)})$ . Since $\mathcal{N}_{\underline{w}}\subseteq\mathcal{N}_{\underline{v}}^{L},$ we have $G(\mathcal{N}_{\underline{w}})\cong^{\phi}G(\mathcal{N}_{\underline{\phi(w)}})$ . Therefore, since agg is a function defined on a multi-set the result will follow from Eq. 20 once we show that $g(\operatorname{\boldsymbol{X}}_{u}^{\ell},d_{w},d_{u})=g(\operatorname{\boldsymbol{X}}_{\phi(u)}^{\ell},d_{\phi(w)},d_{\phi(u)}$ for all $u\in\mathcal{N}_{w}$ . Towards this end, we note that $\operatorname{\boldsymbol{X}}_{u}^{\ell}=\operatorname{\boldsymbol{X}}_{\phi(u)}^{\ell}$ for all $u\in\mathcal{N}_{w}$ by the inductive assumption and the fact that $\mathcal{N}_{\underline{w}}\subseteq\mathcal{N}_{\underline{v}}^{L-\ell}$ . Similarly, since the degree is 1-intrinsic we have $d_{w}=d_{\phi(w)}$ and $d_{u}=d_{\phi(u)}$ , and thus the proof is complete. ∎

A.3 Proof of Lemma 1

We will proceed by induction over $j$ .

$j=0$ : Using the notation established in Notation 1, we have that $\widetilde{U}_{0}=V_{\text{diff}}^{\tilde{d}}$ , and as noted in (i), this is exactly the set of nodes in $\mathcal{N}_{\tilde{v}}^{\tilde{d}}$ where a structural difference manifests rel. to $\phi$ and $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ . Also note that $\delta_{u}^{0}\neq 0$ for $u\in\tilde{U}_{0}$ according to (iv) of the notation. Thus the lemma holds or $j=0$ .

$j\rightarrow j+1$ : Assume Lemma 1 to hold for some $0\leq j<\tilde{d}$ . Thus, by the definition of structural difference, and as $d_{w}=d_{\phi(w)}$ for $w\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right)$ , our claim is equivalent to showing

\operatorname{\boldsymbol{Y}}^{j+1}[u]\neq\operatorname{\boldsymbol{Y}}^{j+1}[\phi(u)],

(24)

and $\delta_{u}^{j+1}\neq 0$ for at least one $u\in\tilde{U}_{j+1}$ and

\operatorname{\boldsymbol{Y}}^{j+1}[w]=\operatorname{\boldsymbol{Y}}^{j+1}[\phi(w)],

(25)

for all $w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1}$ .

Since $\operatorname{\boldsymbol{P}}^{j+1}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{P}}^{j}\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}=\operatorname{\boldsymbol{P}}\operatorname{\boldsymbol{Y}}^{j},$ we may use Eq. 8 to see that Eq. 24 is equivalent to

		$\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[u]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{u}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]$
	$\displaystyle\neq$	$\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[\phi(u)]+\frac{1}{2}\sum_{w\in\operatorname{\mathcal{N}}_{\phi(u)}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]$		(26)

for at least one $u\in\widetilde{U}_{j+1}\subseteq\mathcal{N}_{\tilde{\underline{v}}}^{d-j}$ . One may check that $\widetilde{U}_{j+1}\cap\widetilde{U}_{j}=\emptyset$ . Therefore, inductively applying Eq. 25 (with $j$ in place of $j+1$ ) yields that $\operatorname{\boldsymbol{Y}}^{j}[u]=\operatorname{\boldsymbol{Y}}^{j}[\phi(u)]$ . Additionally, the inductive assumption implies that $\operatorname{\boldsymbol{Y}}^{j}[w]=\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]$ for all $w\in\operatorname{\mathcal{N}}_{u}\setminus\tilde{U}_{j}$ . Therefore, Eq. A.3 is equivalent to

\displaystyle\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap\tilde{U}_{j}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]

\displaystyle\neq\sum_{w\in\operatorname{\mathcal{N}}_{\phi(u)}\cap\phi(\tilde{U}_{j})}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w].

(27)

This indeed holds for at least one $u\in\tilde{U}_{j+1}$ as Eq. 27 is equivalent to

0\neq\sum_{w\in\operatorname{\mathcal{N}}_{u}\cap\tilde{U}_{j}}d_{w}^{-1}\operatorname{\boldsymbol{Y}}^{j}[w]-d_{\phi(w)}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]=\delta_{u}^{j+1},

(28)

and because $\operatorname{int}\left(\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}}\right)$ exhibits no-cc with $\Delta_{j+1}\neq\emptyset$ according to the inductive hypothesis. This completes the proof of Eq. 24.

Similarly Eq. 25 is equivalent to showing

		$\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[w]+\frac{1}{2}\sum_{\tilde{w}\in\operatorname{\mathcal{N}}_{w}}d_{\tilde{w}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}]$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]+\frac{1}{2}\sum_{\tilde{w}\in\operatorname{\mathcal{N}}_{\phi(u_{j+1})}}d_{\tilde{w}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}],$		(29)

for all $w\in\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-(j+1)}\setminus\tilde{U}_{j+1}$ . We first note that $\operatorname{\mathcal{N}}_{\underline{w}}\subset\operatorname{\mathcal{N}}_{\underline{\tilde{v}}}^{\tilde{d}-j}$ , and that $w\not\in\tilde{U}_{j}$ , since otherwise we would have $d(\tilde{v},w)=\tilde{d}-j>\tilde{d}-(j+1).$ Therefore, the inductive hypothesis implies $\operatorname{\boldsymbol{Y}}^{j}[w]=\operatorname{\boldsymbol{Y}}^{j}[\phi(w)]$ . Moreover, no $\tilde{w}\in\mathcal{N}_{w}$ can be an element of $\tilde{U}_{j}$ since otherwise, $w$ would be an element of $\tilde{U}_{j+1}$ , which is a contradiction. Therefore, the inductive assumption implies $\operatorname{\boldsymbol{Y}}^{j}[\tilde{w}]=\operatorname{\boldsymbol{Y}}^{j}[\phi(\tilde{w})]$ for all $\tilde{w}\in\operatorname{\mathcal{N}}_{w}$ . We have already noted that $d_{w}=d_{\phi(w)}$ for all $w\in\operatorname{int}\left(\operatorname{\mathcal{N}}_{\tilde{v}}^{\tilde{D}}\right)$ . Thus, Eq. 29 holds and the proof is complete. ∎

A.4 Proof of Theorem 2

We need to show that we can choose the parameters in a way that guarantees $F_{p-sct}(\operatorname{\boldsymbol{X}})[v]\neq F_{p-sct}(\operatorname{\boldsymbol{X}})[\phi(v)]$ . For simplicity, we set $\operatorname{\boldsymbol{\Theta}}=\operatorname{\boldsymbol{I}}_{n}$ . In this case, since $\sigma$ strictly monotonic, and therefore injective, it suffices to show that we can construct $p$ such that

\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[v]\neq\operatorname{\boldsymbol{U}}_{p}(\operatorname{\boldsymbol{X}})[\phi(v)].

(30)

Using binary expansion, we may choose $k_{1},\ldots,k_{m}\in\mathbb{N}_{0}$ , $k_{i}<k_{i+1}$ , such that $d=2^{k_{1}}+\ldots 2^{k_{m}}$ . We will show that Eq. 30 holds for $p\coloneqq(k_{1},\ldots,k_{m}).$ For $1\leq i\leq m,$ let $p_{:i}\coloneqq(k_{1},\ldots,k_{i})$ and let $t_{i}=\sum_{j=1}^{i}2^{k_{j}}$ , where $p_{:0}$ denotes the empty path of length 0 and $t_{0}=0.$

Recall the generalized path $P=(U_{0},\ldots,U_{d})$ defined in Notation 1 (iii). We will use induction to show that, for $0\leq i\leq m$ , there exists at least one node in $U_{t_{i}}$ where a structural difference is manifested, while no structural differences manifests in $\mathcal{N}_{\underline{v}}^{d-t_{i}}\setminus U_{t_{i}}$ rel. to $\phi$ and $\operatorname{\boldsymbol{Z}}^{i}\coloneqq\operatorname{\boldsymbol{U}}_{p_{:i}}\operatorname{\boldsymbol{X}}.$ Since $t_{m}=d$ and $p_{:m}=p$ , this claim will imply Eq. 30 and thus prove the theorem.

$i=0$ : Analogous to the proof of Lemma 1, using the notation established in Notation 1 (ii), we have that $U_{0}=V_{\text{diff}}^{d}$ , which are exactly the nodes in $\mathcal{N}_{v}^{d}$ where a structural difference manifests. Thus the claim holds for $i=0$ .

$i\rightarrow i+1$ : We now assume the result holds for $i$ . Since the degree is one-intrinsic and $t_{i+1}\geq 1$ , we have that $d_{\tilde{v}}=d_{\phi(\tilde{v})}$ for all $\tilde{v}\in\mathcal{N}_{\underline{v}}^{d-t_{i+1}}$ . Therefore, the claim is equivalent to showing

\operatorname{\boldsymbol{Z}}^{i+1}[u]\neq\operatorname{\boldsymbol{Z}}^{i+1}[\phi(u)],

(31)

for at least one $u\in U_{t_{i+1}}$ and

\operatorname{\boldsymbol{Z}}^{i+1}[w]=\operatorname{\boldsymbol{Z}}^{i+1}[\phi(w)],

(32)

for all $w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}$ .

By the inductive assumption, Eq. 31 and Eq. 32 hold with $i$ in place of $i+1$ and since $\sigma$ is injective, they continue to hold when $\operatorname{\boldsymbol{Z}}^{i}$ is replaced with $\sigma(\operatorname{\boldsymbol{Z}}^{i})$ . Therefore, we may apply Lemma 1 with $\sigma(\operatorname{\boldsymbol{Z}}^{i})$ in place of $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ , and $j=2^{k_{i+1}}$ . By the inductive assumption, we have that $\tilde{d}$ as defined in Lemma 1 is given by $\tilde{d}=d-t_{i}$ . Therefore, $\tilde{d}-j=d-t_{i}-2^{k_{i+1}}=d-t_{i+1}$ and $\tilde{U}_{j}=U_{t_{i+1}}$ . Therefore, there exists at least one $u\in U_{t_{i+1}}$ where a structural difference is manifested, while for all $w\in\operatorname{\mathcal{N}}_{\underline{v}}^{d-t_{i+1}}\setminus U_{t_{i+1}}$ , no structural difference is manifested rel. to $\phi$ and $\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)$ , i.e.,

\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]\neq\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)]

(33)

and

\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)].

(34)

Next, we again apply Lemma 1 with $\sigma(\operatorname{\boldsymbol{Z}}^{i})$ in place of $\operatorname{\tilde{\operatorname{\boldsymbol{X}}}}$ , but this time with $j=2^{k_{i+1}-1}$ . We observe now that $\tilde{d}-j=d-t_{i}-2^{k_{i+1}-1}>d-t_{i+1}$ and thus $u,w\in\operatorname{\mathcal{N}}_{\underline{v}}^{\tilde{d}-t_{i+1}}\subset\operatorname{\mathcal{N}}_{\underline{v}}^{\tilde{d}-j}\setminus\tilde{U}_{j}$ . Therefore, no structural difference is manifested at either $u$ or $w$ rel. to $\phi$ and $\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}$ , i.e.,

\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)]

(35)

and

\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{P}}^{2^{k_{i+1}-1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)].

(36)

Together Eq. 33-34 and Eq. 35-36 imply

\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[u]\neq\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(u)]

and

\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[w]=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)[\phi(w)].

Eq. 31 and 32 now follow as $\operatorname{\boldsymbol{Z}}^{i+1}=\operatorname{\boldsymbol{\Psi}}_{k_{i+1}}\sigma\left(\operatorname{\boldsymbol{Z}}^{i}\right)$ . ∎

A.5 Modified Variant of Theorem 2

Theorem 3.

Similar to Theorem 2, let $\phi:V\rightarrow V$ , $v\in V$ and $K,L\in\operatorname{\mathbb{N}}$ such that $G\left(\operatorname{\mathcal{N}}_{\underline{v}}^{K+L}\right)\cong^{\phi}G\Big{(}\operatorname{\mathcal{N}}_{\underline{\phi(v)}}^{K+L}\Big{)}$ , and consider any K-intrinsic node feature matrix $\operatorname{\boldsymbol{X}}$ . Suppose there exist nodes $U\subset\operatorname{\mathcal{N}}_{v}^{K+L}$ where a structural difference rel. to $\phi$ and $\operatorname{\boldsymbol{X}}$ is manifested. If the nonlinearity $\sigma$ is strictly monotonic and there exists a unique $u\in U$ with minimal distance from $v$ and a unique shortest path between $u$ and $v$ , we can define a scattering configuration $(p,\operatorname{\boldsymbol{\Theta}})$ such that scattering features $F_{p\text{-scat}}(\operatorname{\boldsymbol{X}})$ defined as in Eq. 13 discriminate $v$ and $\phi(v)$ .

In Theorem 3, we replace the assumption of no coincidental correspondence (Definition 4) in Theorem 2 by the assumption of a unique shortest path between $v$ and some node $u$ , which is the unique nearest node from $v$ , where a structural difference is manifested.

Proof of Theorem 3.

The only difference between Theorem 3 and Theorem 2 is that in Theorem 2 we assume that there is no coincidental correspondence whereas here we assume that there is a unique $u\in U$ with minimal distance from $v$ and that there is a unique shortest path between $v$ and $u$ . Inspecting the proof of Theorem 2, we see that the only spot where we used the assumption of no coincidental correspondence was when we invoked Lemma 1. Therefore, it suffices to show that the conclusion of Lemma 1 holds under our revised assumptions. Moreover, inspecting the proof of Lemma 1, we see that the only spot we use the assumption of no coincidental correspondence was to establish Eq. 28. Thus, it suffices to show that Eq. 28 still holds.

Let $P\coloneqq(U_{0},U_{1},\dots,U_{d})$ be the generalized path from $u$ to $v$ as defined in Notation 1, and note that our assumption of a unique shortest path implies that each $U_{j}$ consists of a single node $u_{j}$ , i.e., $U_{j}\coloneqq\{u_{j}\}$ . Since the assumption of no coincidental correspondence is not used in the base case, we can use the inductive hypothesis that $u_{j}$ is the unique element of $\mathcal{N}_{\underline{v}}^{d-j}$ where a structural difference manifests with respect to $\operatorname{\boldsymbol{Y}}^{j}$ . We are then left to show that this implies that Eq. 28 holds. But Eq. 28 simplifies to $0\neq d_{u_{j}}^{-1}\operatorname{\boldsymbol{Y}}^{j}[u_{j}]-d_{\phi(u_{j})}^{-1}\operatorname{\boldsymbol{Y}}^{j}[\phi(u_{j})]$ , which is true according to the inductive hypothesis. ∎

Appendix B

B.1 Technical Details

TABLE VI: Chosen hyperparameters for our Sc-GCN model on the examined datasets.

			$\overbrace{\hskip 60.0pt}^{\text{Scat.\ config.:}}$		$\overbrace{\hskip 130.0pt}^{\text{Channel widths:}}$
	$\alpha$	$q$	$\boldsymbol{U}_{p_{1}}$	$\boldsymbol{U}_{p_{2}}$	$\,\boldsymbol{A}^{1}\,$	$\,\boldsymbol{A}^{2}\,$	$\,\boldsymbol{A}^{3}\,$	$\boldsymbol{U}_{p_{1}}$	$\boldsymbol{U}_{p_{2}}$
Citeseer	0.50	4	$\boldsymbol{\Psi}_{2}$	$\boldsymbol{\Psi}_{2}\|\boldsymbol{\Psi}_{3}\|$	10	10	10	9	30
Cora	0.35	4	$\boldsymbol{\Psi}_{1}$	$\boldsymbol{\Psi}_{3}$	10	10	10	11	6
Pubmed	1.00	4	$\boldsymbol{\Psi}_{1}$	$\boldsymbol{\Psi}_{2}$	10	10	10	13	14
DBLP (1st layer)	1.00	4	$\boldsymbol{\Psi}_{1}$	$\boldsymbol{\Psi}_{2}$	10	10	10	30	30
DBLP (2nd layer)	0.10	1	$\boldsymbol{\Psi}_{1}$	$\boldsymbol{\Psi}_{2}$	40	20	20	20	20

TABLE VII: Chosen hyperparameters for our GSAN model on the examined datasets.

	$\alpha$	Number of heads	Channel width
Texas	1.5	4	128
Chameleon	0.2	64	32
CoraFull	0.5	20	128
Wiki-CS	0.3	20	20
Citeseer	0.1	8	64
Pubmed	0.1	50	64
Cora	0.1	50	64
DBLP	0.2	16	128

As with most neural networks, when implementing Sc-GCN and GSAN, one must make several architecture choices and tune several hyperparameters. In our experiments with Sc-GCN, we used either one or two hybrid layers, each comprised of three GCN channels and two scattering channels, followed by a single residual convolution layer. This fairly simple setup simplified the network tuning process and was sufficient to obtain strong results, which outperformed numerous competing methods. However, it is likely that performance can be further improved by using wider or deeper networks. For Cora, Citeseer and Pubmed we used only one hybrid layer as preliminary experiments indicated that the addition of a second one was not cost-effective (in light of additional complexity created by the need to tune more hyperparameters). For DBLP, two layers were used due to a significant increase in performance. We note that even with a single hybrid layer our model achieves $73.1\%$ test accuracy (compared to $81.5\%$ with two layers) and still significantly outperforms GAT ( $66.1\%$ ) and the other methods (below $60\%$ ).

To illustrate the utility of our model on large and challenging data sets, we now compare against GCN and GAT on two OGB benchmark node classification tasks. Our results are shown in Tab VIII. GSAN achieves 73.9% using 3,832 parameters on ogbn-proteins and 72.3% on ogbn-arxiv using 70,760 parameters, while GCN achieves 72.5% ogbn-proteins and using 110,120 parameters and 71.7% on ogbn-arxiv using 96,880 parameters, GAT achieves 73.6% on ogbn-proteins using 1,238,554 parameters and 71.2% on on ogbn-arxiv using 950,620 parameters. This suggests that our model can be more parameter-efficient on large datasets, especially on ogbn-proteins dataset, where we only use 3.5% of the GCN’s parameter counts. Furthermore, GSAN achieves 76.2% on ogbn-proteins using 31,856 when we increase the width of GSAN. Thus, we see that our network is able to get better performance with fewer learned parameters.

TABLE VIII: Comparison of node classification test accuracy on Open Graph Benchmark datasets proteins and arxiv (higher is better).

Dataset	#nodes	#edges	GCN	GAT	Sc-GCN	GSAN
proteins	1.3e5	4.0e7	0.725	0.736	0.742	0.739
arxiv	1.7e5	1.1e6	0.717	0.712	0.719	0.723

Validation & testing procedure: All tests were done using train-validation-test splits of the datasets. We used the validation accuracy to tune the hyperparameters and reported the test accuracy in the comparison tables. To ensure fair comparison, we used the same splits for all methods. On Citeseer, Cora and Pubmed, we used the same settings as in [5], and followed the standard practices used in other works which have used these datasets. For DBLP, as far as we know, no common standard is established in the literature. Here, we used a ratio of $5:1:1$ between train, validation, and test for all node-classification datasets. For the datasets for graph classification and regression, we used a ratio of $8:1:1$ .

Hyperparameter tuning: We tuned our hyperparameters on each set using a grid search and selected the setting which yields the highest accuracy on the validation set.

In Sc-GCN, we used the grid search to tune the parameter $\alpha$ used in the residual convolution, the exponent $q$ used in the nonlinearity, the scattering channel configurations $p_{1},p_{2}$ (i.e., scales used in these two channels), and the widths of channels in the network, i.e. the hidden dimensions. The results of this tuning process are presented in Tab. VI. For DBLP, the hybrid-layer parameters are shared between the two layers in order to simplify the tuning process. This was generally less exhaustive than the process used for the other three datasets, since our method significantly outperformed all other methods even with limited tuning.

TABLE IX: Classification accuracies on ogbn-proteins dataset with different

\alpha

$\alpha$	0.01	0.1	0.5
Accuracy	$72.32\pm 0.55$	$72.03\pm 0.81$	$72.45\pm 1.30$
$\alpha$	1.0	3.0	5.0
Accuracy	$73.93\pm 0.19$	$72.23\pm 0.45$	$71.45\pm 1.41$

For GSAN, the hyperparameter selection is less intricate, which in addition to improved performance, is one of the main advantages compared to Sc-GCN. In particular, we use the same scattering configurations and channel widths across all channels in a given layer. We do not tune the scattering configurations. We always set $m=1$ and always use the wavelets $\operatorname{\boldsymbol{\Psi}}_{1},\operatorname{\boldsymbol{\Psi}}_{2},$ and $\operatorname{\boldsymbol{\Psi}}_{3}$ in our three scattering channels. Therefore, the hyperparameters simplify to $\alpha$ , a (universal) channel width and the number of attention heads. These parameters might differ across different layers if the architecture relies on several ones. The results of this tuning process are presented in Tab. VII.

Hardware & software environment: All experiments were done on the same HPC cluster with intel i7-6850K CPU and NVIDIA TITAN X Pascal GPU. Both Sc-GCN and GSAN were implemented using PyTorch [54]. Implementations of all other methods were taken directly from the code accompanying their publications.

B.2 Ablation Study

The two primary novelties in Sc-GCN and GSAN are the scattering channels (i.e., $\operatorname{\boldsymbol{\Psi}}_{1},\operatorname{\boldsymbol{\Psi}}_{2},$ and $\operatorname{\boldsymbol{\Psi}}_{3}$ ) and the residual convolution (controlled by the hyperparameter $\alpha$ ). To explore their contribution and the dependence of the networks performance on the hyperparameters, Tab. IX show classification results over the ogbn-proteins dataset for $\alpha=0.01,0.1,0.5,1.0,3.0,5.0$ over multiple scattering channel configurations. For simplicity, we focus on ogbn-proteins, but note that similar results are observed on the other datasets.

TABLE X: Impact of removing each individual channel from the optimal configuration on ogbn-proteins, while classifying using the remaining four channels. Full GSAN accuracy is 73.93% .

Channel	$\operatorname{\boldsymbol{A}}$	$\operatorname{\boldsymbol{A}}^{2}$	$\operatorname{\boldsymbol{A}}^{3}$	$\operatorname{\boldsymbol{\Psi}}_{3}$	$\operatorname{\boldsymbol{\Psi}}_{2}$	$\operatorname{\boldsymbol{\Psi}}_{1}$
Accuracy	71.22	71.23	72.16	71.47	72.98	70.75

To examine the importance of the parameter $\alpha$ used in the residual graph convolution layer, we observe that setting $\alpha=0$ effectively eliminates this layer (since then $\operatorname{\boldsymbol{A}}_{\text{res}}=\operatorname{\boldsymbol{I}}_{n}$ ). On the other hand, increasing $\alpha$ makes the filtering effect stronger, approaching a random-walk filtering as $\alpha\rightarrow\infty$ . We evaluate the effect of this parameter on classification accuracy. Our results, displayed in Tab. IX, indicate that increasing $\alpha$ to non-negligible nonzero values improves classification performance, which we interpret to be due to the removal of high-frequency noise. However, when $\alpha$ further increases (in particular when $\alpha=5$ in this case) the smoothing provided by this layer degrades the performance to a level close to the traditional GCN [5].

Finally, to examine the relative importance of the low-pass and band-pass information, we provide an ablation study of the impact each channel. We focus on the ogbn-proteins dataset, set $\alpha=1.0$ and use the best channel configuration, which achieved 73.93% accuracy when all channels were used. Then, we remove each of the band-pass or low-pass channels individually from the network, and reevaluate network performance with the remaining four channels. Our results are displayed in Tab. X. They indicate that the information captured by both the band-pass and the low-pass channels is important and that eliminating either decreases performance. In particular, we note that eliminating $\boldsymbol{\Psi}_{1}$ has a major impact on the accuracy.

B.3 Implementation

Python code accompanying this work is available on github.com/dms-net/scatteringGCN and github.com/dms-net/Attention-based-Scattering.

	$\displaystyle c_{1}\\|\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\sum_{k=0}^{K}$	$\displaystyle\\|\operatorname{\boldsymbol{\Psi}}_{k}\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}$
	$\displaystyle+$	$\displaystyle\\|\operatorname{\boldsymbol{\Phi}}_{K}\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}\leq\\|\operatorname{\boldsymbol{x}}\\|^{2}_{\operatorname{\boldsymbol{D}}^{-1/2}}.$