Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks
Abstract
Geometric deep learning has made great strides towards generalizing the design of structure-aware neural networks from traditional domains to non-Euclidean ones, giving rise to graph neural networks (GNN) that can be applied to graph-structured data arising in, e.g., social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models are often constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, leading to oversmoothing. Moreover, to avoid severe oversmoothing, most popular GCN-style networks tend to be shallow, with narrow receptive fields, leading to underreaching. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via geometric scattering. We further introduce an attention framework that allows the model to locally attend over combined information from different filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.
Index Terms:
Geometric Deep Learning, Graph Neural Networks, Geometric Scattering, Oversmoothing, Underreaching.1 Introduction
Deep learning is typically most effective when the structure of the data can be used to design the architecture of the relevant network. For example, the design of recurrent neural networks is informed by the sequential nature of time-series data. Similarly, the design of convolutional neural networks is based in part on the fact that the pixels of an image are arranged in a rectangular grid. The success of neural networks in these, as well as many other applications, has inspired the rise of geometric deep learning[1, 2], which aims to extend the success of deep learning to other forms of structured data and to develop intelligent methods for data sets that have a non-Euclidean structure.
A common approach in geometric deep learning is to model the data by a graph. In many applications, this is done by defining edges between data points that interact in a specific way, e.g., “friends” on a social network. In many other applications, one may construct a graph from a high-dimensional data set, either by defining an edge between each point and its -nearest neighbors or by defining weighted edges via a similarity kernel. Inspired by the increasing ubiquity of graph-structured data, numerous recent works have shown graph neural networks (GNNs) to perform well in a variety of fields including biology, chemistry and social networks [3, 4, 5]. In these methods, the graph is often considered in conjunction with a set of node features, which contain “local” information about, e.g., each user of a social network.
One common family of tasks are graph-level tasks, where one seeks to learn a whole-graph representation for the purposes of, e.g., predicting properties of proteins [6, 7, 8]. Another common family, which has been the primary focus of graph convolutional networks (GCNs) [5], are node-level tasks such as node classification. There, the entire data set is modeled as one large graph and the network aims to produce a useful representation of each node using both the node features and the graph structure. This work is typically conducted in the semi-supervised setting where one only knows the labels of a small fraction of the nodes.
Many popular state-of-the-art GNNs essentially aim to promote similarity between adjacent nodes, which may be interpreted as a smoothing operation. While this is effective in certain settings, it can also cause a decrease in performance because of the oversmoothing problem [9], where nodes become increasingly indistinguishable from one another after each subsequent layer. In order to partially mitigate the oversmoothing problem, many popular GNNs only use two or three layers. While this does help avoid oversmoothing, it creates a new problem, underreaching, where the network is unable to incorporate long-range dependencies or global geometry.
Here, we propose to augment traditional GNN architectures by also including novel band-pass filters, in addition to conventional GCN-style filters111Throughout this text, we will use the term GCN to refer to the network introduced by Kipf and Welling in [5]. We will use the term GNN to refer to graph neural networks (spectral, convolutional, or otherwise) in general. that essentially perform low-pass filtering [10], in order to extract richer representations for each node. This approach is based on the geometric scattering transform [11, 12, 13], whose construction is inspired by the Euclidean scattering transform introduced by Mallat in [14], and utilizes iterative cascades of graph wavelets [15, 16] and pointwise nonlinear activation functions to produce useful graph data representations. Notably, these representations are able to capture both high-frequency information and long-range dependencies.
The main contribution of this work is two hybrid GNN frameworks that utilize both traditional GCN-style filters and also novel filters based on the scattering transform. This approach is based on the following simple idea: GCN-based networks are very useful, but as they aim to enforce similarity among nodes, they essentially focus on low-frequency information. Wavelets, on the other hand, are naturally equipped to capture high-frequency information. In the spatial domain, the GCN-style filters can be thought of as focusing on localized information where the wavelets are able to capture longer-range interactions. Therefore, in a hybrid network, the different channels capture different types of information. Such a network is therefore more powerful than a network that only uses one style of filter.
We also introduce complementary GNN modules that enhance the performance of such hybrid scattering models, including (i) the graph residual convolution, an adaptive low-pass filter that corrects high-frequency noise, and (ii) an attention framework that enables the aggregation of information from different filters individually at every node. We present theoretical results, based off of a new notion of graph structural difference, that highlight the sensitivity of scattering filters to graph regularity patterns not captured by GCN filters due to underreaching. Extensive empirical experiments demonstrate the ability of hybrid scattering networks for (transductive) semi-supervised node classification, to (i) alleviate oversmoothing and (ii) generalize to complex (low-homophily) datasets. Moreover, we also show that our framework translates well to inductive graph-level tasks.
The remainder of this paper is organized as follows. We review related work on GNN models and geometric scattering in Sec. 2 and introduce important concepts that will be used throughout this work in Sec. 3. We then formulate the hybrid scattering network in Sec. 4, followed by a theoretical study of the benefits of such models in Sec. 5. In Sec. 6, we present empirical results before concluding in Sec. 7.
2 Related Work
Theoretical analyses[9, 10] of GCN and related models show that they may be viewed as Laplacian smoothing operations and, from the signal processing perspective, essentially perform low-pass filters on the graph features. One approach towards addressing this problem is the graph attention network proposed by [17], which uses self-attention mechanisms to address these shortcomings by adaptively reweighting local neighborhoods. In [18], the authors construct a low-rank approximation of the graph Laplacian that efficiently gathers multi-scale information and demonstrate the effectiveness of their method on citation networks and the QM8 quantum chemistry dataset. In [19], the authors take an approach similar to GCN, but use multiple powers of the adjacency matrix to learn higher-order neighborhood information. Finally, in [20] the authors used graph wavelets to extract higher-order neighborhood.
In addition to the learned networks discussed above, several works[11, 12, 13] have introduced different variations of the graph scattering transform. These papers aim to extend the Euclidean scattering transform of Mallat [14] to graph-structured data and propose predesigned, wavelet-based networks. In [13, 11, 21, 22], extensive theoretical studies of these networks show that they have desirable stability, invariance, and conservation of energy properties. The practical utility of these networks has been established in [12], which primarily focuses on graph classification, and in [23, 24, 25], which used the graph scattering transform to generate molecules. Building off of these results, which use handcrafted formulations of the scattering transform, recent work [26] has proposed a framework for a data-driven tuning of the traditionally handcrafted geometric scattering design that maintains the theoretical properties from traditional designs, while also showing strong empirical results in whole-graph settings.
3 Geometric Deep Learning Background
3.1 Graph Signal Processing
Let be a weighted graph, characterized by a set of nodes (also called vertices) , a set of undirected edges , and a function assigning positive edge weights to the edges. Let be a node features matrix. We shall interpret the row of as representing the features of the node , and therefore, we shall denote these rows by either or . The columns of , on the other hand, will be denoted by . Each of these columns may be naturally identified with a graph signal, i.e., a function , . In what follows, for simplicity, we will not distinguish between the vectors and the functions and will refer to both as graph signals.
We define the weighted adjacency matrix of by if , and set all other entries to zero. We further define the degree matrix as the diagonal matrix with each diagonal element being the degree of a node . In the following, we will also use the shorthand to denote the degree of . We consider the combinatorial graph Laplacian matrix and the symmetric normalized Laplacian given by
It is well known that this matrix is symmetric, positive semi-definite, and admits an orthonormal basis of eigenvectors such that Therefore, we may write
where and is the orthogonal matrix whose -th column is .
We will use the eigendecomposition of to define the graph Fourier transform, with the eigenvectors being interpreted as Fourier modes. The notion of oscillation on irregular domains like graphs is delicate, but can be reframed in terms of increasing variation of the modes, with the eigenvalues interpreted as (squared) frequencies.222This interpretation originates from motivating the graph Fourier transform via the combinatorial graph Laplacian with the variation of . The Fourier transform of a graph signal is defined by for and the inverse Fourier transform may be computed by . It will frequently be convenient to write these equations in matrix form as and .
We recall that in the Euclidean setting, the convolution of two signals in the spatial domain corresponds to the pointwise multiplication of their Fourier transforms. Therefore, we may define the convolution of a signal with a filter by the rule that is the unique vector such that for . Applying the inverse Fourier transform, one may verify that
(1) |
where . Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in .
3.2 Spectral Graph Neural Network Constructions
A graph filter is a function that transforms a node feature matrix into a new feature matrix . GNNs typically feature several layers each of which produces a new set of features by filtering the output of the previous layer. We will usually let denote the initial node feature matrix, which is the input to the network, and let denote the node feature matrix after the -th layer.
In light of Eq. 1, a natural way to construct learned graph filters would be to directly learn the Fourier coefficients in . Indeed this was the approach used in the pioneering work of Bruna et al. [27]. However, this approach has several drawbacks. First, it results in learnable parameters in each convolutional filter. Therefore, a network using such filters would not scale well to large data sets due to the computational cost. At the same time, such filters are not necessarily well-localized in space and are prone to overfitting [28]. Moreover, networks of the form introduced in [27] typically cannot be generalized to different graphs [1]. However, recent work [29] has shown that this latter issue can be overcome by formulating the Fourier coefficients as smooth functions of the Laplacian eigenvalues , . In particular, this will be the case for the filters used in the networks considered in this work.
A common approach (e.g., used in [28, 5, 30, 31, 18]) to formulate such filters is by using polynomials of the Laplacian eigenvalues to set (or equivalently ) for some . It can be verified [28] that this approach yields convolutional filters that are -localized in space and that can be written as This reduces the number of trainable parameters in each filter from to and allows one to perform convolution without explicitly computing the spectral decomposition of the Laplacian, which is expensive for large graphs.
One particularly noteworthy network that uses this method is [28], which writes the filters in terms of the Chebyshev polynomials defined by , and . They first renormalize the eigenvalue matrix and then define . Letting , yields
(2) |
3.3 Graph Convolutional Networks
One of the most widely used GNNs is the Graph Convolutional Network (GCN) [5]. This network is derived from the Chebyshev construction[28] mentioned above by setting in Eq. 2 and approximating , which yields
To further reduce the number of trainable parameters, the authors then set . The resulting convolutional filter has the form
(3) |
One may verify that , and therefore, Eq. 3 essentially corresponds to setting in Eq. 1. The eigenvalues of take values in Thus, to avoid vanishing or exploding gradients, the authors use a renormalization trick
(4) |
where and is a diagonal matrix with for . Setting and using multiple channels we obtain a layer-wise propagation rule of the form where is the number of channels used in the -th layer and is an elementwise nonlinearity. In matrix form we write
(5) |
We interpret the matrix as computing a localized average of each channel around each mode and the matrix as sharing information across channels. This filter can also be observed at the node level as
where denotes the one-step neighborhood of node and . This process can be split into three steps:
(6a) | |||
(6b) | |||
(6c) |
which we refer to as the transformation step (Eq. 6a), the aggregation step (Eq. 6b) and the activation step (Eq. 6c).
As discussed earlier, the GCN filter described above may be viewed as a low-pass filter that suppresses high-frequency information. For simplicity, we focus on the convolution in Eq. 3 before the renormalization. This convolution essentially corresponds to point-wise Fourier multiplication by , which is strictly decreasing in . Therefore, repeated applications of this filter effectively zero-out the higher frequencies. This is consistent with the oversmoothing problem discussed in [9].
3.4 Graph Attention Networks
Another popular network that is widely used for node classification tasks is the graph attention network (GAT) [17], which uses an attention mechanism to guide and adjust the aggregation of features from adjacent nodes. First, the node features are linearly transformed to using a learned weight matrix . Then, the aggregation coefficients are learned via
where is a shared attention vector and denotes horizontal concatenation. The output feature corresponding to a single attention head is given by . To increase the expressivity of this network, the authors then use a multi-headed attention mechanism, with heads, to generate concatenated features
(7) |
3.5 Challenges in Geometric Deep Learning
Many GNN models, including GCN [5] and GAT [17], are subject to the so-called oversmoothing problem [9], caused by aggregation steps (such as Eq. 6b) that essentially consist of localized averaging operations. As discussed in Sec. 3.3 and also [10], from a signal processing point of view, this corresponds to a low-pass filtering of graph signals. Moreover, as discussed in [32], these networks are also subject to underreaching. Most GNNs (including GCN and GAT) can only relate information from nodes within a distance equal to the number of GNN layers, and because of the oversmoothing problem, they typically use a small number of layers in practice. Therefore, the oversmoothing and underreaching problems combine to significantly limit the ability of GNNs to capture long-range dependencies. In Sec. 4, we will introduce a hybrid network, which aims to address these challenges by using both GCN-style channels and channels based on the geometric scattering transform discussed below.
3.6 Geometric Scattering
In this section, we review the geometric scattering transform constructed in [12] for graph classification and show how it may be adapted for node-level tasks. As we shall see, this node-level geometric scattering will address the challenges discussed above in Sec. 3.5, by using band-pass filters that capture high-frequency information and have wider receptive fields.
The geometric scattering transform uses wavelets based upon raising the lazy random walk matrix
to dyadic powers , which can be interpreted as differing degrees of resolution. The right eigenvectors and eigenvalues of are and , respectively. Entrywise, we note that
(8) |
Thus, may be viewed as a localized averaging operation operator analogous to those used in, e.g., GCN, and the powers may be viewed as low-pass filters which suppress high-frequencies. In order to better retain this high-frequency information, we define multiscale graph diffusion wavelets by subtracting these low-pass filters at different scales [16]. Specifically, for , we define a wavelet at scale by
(9) |
We may interpret each as capturing information at a different frequency band. From a spatial perspective, we may view as encoding information on how a -step neighborhood differs from a smaller -step one. Such wavelets are usually organized in a filter bank
(10) |
along with a low-pass filter . Proposition 2.2 of [22] (restated in here as Proposition 1) shows that this filter bank is a self-adjoint, non-expansive frame on a suitably weighted inner-product space.

The geometric scattering transform is a multi-layered architecture that iteratively applies wavelet convolutions and nonlinear activation functions in an alternating cascade as illustrated in Fig. 1. It is parameterized by paths . Formally, we define
(11) |
where is a nonlinear activation function.333In a slight deviation from previous work, here does not include the outermost nonlinearity in the cascade. We note that in other work focusing on graph-classification, the scattering features are frequently aggregated into -order moments,
(12) |
We also note that the nonlinearity might vary in each step of the cascade. However, we will suppress this possible dependence to avoid cumbersome notation. We also note that in our theoretical results, if we assume, for example, that is strictly increasing, this assumption is intended to apply to all nonlinearities used in the cascade. In our experiments, we use the absolute value, i.e., .
The original formulations of geometric scattering were fully designed networks without any learned convolutions between channels. Here, we will incorporate learning by defining the following scattering propagation rule similar to the one used by GCN:
(13) |
Analogously to GCN, we note that for , the scattering propagation rule can be decomposed into three steps:
(14a) | |||
(14b) | |||
(14c) |
Importantly, we note that the scattering transform addresses the underreaching problem as wavelets that are leveraged in the aggregation step (Eq. 14b) have larger receptive fields than most traditional GNNs. However, scattering does not result in oversmoothing because the subtraction results in band-pass filters rather than low-pass filters. In this manner, the scattering transform addresses the challenges discussed in the previous subsection.
4 Hybrid Scattering Networks
Here, we introduce two hybrid networks that combine aspects of GCN and the geometric scattering transform discussed in the previous section. Our networks will use both low-pass and band-pass filters in different channels to capture different types of information. As a result, our hybrid networks will have greater expressive power than either traditional GCNs, which only use low-pass filters, or a pure scattering network, which only uses band-pass filters.
In our low-pass channels, we use modified GCN filters, which are similar to those used in Eq. 5 but use higher powers of and include bias terms. Specifically, we use a channel update rule of the form
(15) |
for . We note, in particular, that the use of higher powers of enables a wider receptive field (of radius ), without increasing the number of trainable parameters (unlike in GCN).
Similarly, in our band-pass channels, we use a version of Eq. 13, with an added bias term, and our update rule is
(16) |
for . Here, similarly to Eq. 11, is a path that determines the cascade of wavelets used in the -th channel.
Aggregation module. Each hybrid layer uses low-pass and band-pass channels, described in Eq. 15 and Eq. 16, to transform the node features . The resulting are aggregated to new -dimensional representations via an aggregation module such as those discussed in Sections 4.1 and 4.2.
(17) |
Graph Residual Convolution. After aggregating the outputs of the low-pass channels and band-pass channels in Eq. 17, we apply the graph residual convolution, which acts as a low-pass filtering and aims to eliminate any high-frequency noise introduced by the scattering channels. This noise can arise, e.g., if there are various different label rates in different graph substructures. In this case, the scattering features may learn the difference between labeled and unlabeled nodes and thereby produce high-frequency noise.
This filter uses a modified diffusion matrix given by
where the hyperparameter determines the magnitude of the low-pass filtering. Choosing yields the identity (no filtering), while results in the random walk matrix . Thus, can be interpreted as lying between a completely lazy random walk that never moves and a non-resting one that moves at every time step.
The full residual convolution update rule is given by
The multiplication with corresponds to a fully connected layer applied to the output from Eq. 17 (after filtering by ) with each neuron learning a linear combination of the signals output by the aggregation module.
4.1 Scattering GCN
The Scattering GCN (Sc-GCN) network was first introduced in [33]. Here, the aggregation module concatenates the filter responses horizontally yielding wide node representations of the form
(18) |
Sc-GCN then learns relationships between the channels via the graph residual convolution.

Configuration. The primary goal of Sc-GCN is to alleviate oversmoothing in popular semi-supervised node-level tasks on, e.g., citation networks. As regularity patterns in such datasets are often dominated by low-frequency information such as inter-cluster node-similarities, we choose our parameters to focus on low-frequency information. We use three low-pass filters, with receptive fields of size (or radius) , and two band-pass filters. We use as our nonlinearity in all steps except the outermost nonlinearity. Inspired by the aggregation step in classical geometric scattering[12], for the outermost nonlinearity, we additionally apply the power at the node level, i.e., .
The paths and the parameter from the graph residual convolution are tuned as hyperparameters of the network.
4.2 Geometric Scattering Attention Network
An important observation in Sc-GCN above is that the model decides globally about how to combine different channel information. The network first concatenates all of the features from the low-pass and band-pass channels in Eq. 18 and then combines these features via multiplication with the weight matrix . However, for complex tasks or datasets, important regularity patterns may vary significantly across different graph regions. In such settings, a model should ideally attend locally over the aggregation and adaptively combine filter responses at different nodes.
This observation inspired the design of the Geometric Scattering Attention Network (GSAN) [34]. Drawing inspiration from [35], GSAN uses an aggregation module based on a multi-head node-level attention framework. However, the attention mechanism used in GSAN differs from [35] by attending over the combination of the different filter responses rather than over the combination of node features from neighboring nodes. We will first focus on the processing performed independently by each attention head and then discuss a multi-head configuration.

In a slight deviation from Sc-GCN, the weight matrices in Eq. 15 and Eq. 16 are shared across the filters of each attention head. Therefore, both aggregation steps (Eq. 6b and Eq. 14b) take the same input denoted by . Next, we define and to be the outputs of the aggregation steps, Eq. 6b and Eq. 14b, with the bias terms set to zero. We then compute attention score vectors that will be used to determine the importance of each of the filter responses for every single node. We calculate
with analogous and being a learned attention vector that is shared across all filters of the attention head. These attention scores are then normalized across all filters using the softmax function. Specifically, we define
where the exponential function is applied elementwise, and define analogously. Finally, for every node, the filter responses are summed together, weighted by the corresponding (node-to-filter) attention score. We also normalize by the number of filters , which gives
where denotes the Hadamard (elementwise) product of with each column of . We further use in the equation above.
Multi-head Attention. Similar to other applications of attention mechanisms [35], we use multi-head attention to stabilize the training, thus yielding as output of the aggregation module (by a slight abuse of notation)
(19) |
concatenating attention heads. As explained above, each attention head has individually trained parameters and and outputs a filter response . Similar to Sc-GCN, this concatenation results in wide node representations that are further refined by the graph residual convolution.
Configuration. For GSAN, we set , giving the model balanced access to both low-pass and band-pass information. The aggregation process of the attention layer is shown in Fig. 3, where represent three first-order scattering transformations with , and . The number of attention heads and the parameter from the graph residual convolution are tuned as hyperparameters of the network.
5 Theory of Hybrid Scattering Networks
In this section, we compare the descriptive power of the scattering transform to a class of GNNs defined below, which we refer to as aggregate-combine GNNs (AC-GNNs). We note that in practice, AC-GNNs are typically implemented with relatively narrow receptive fields in order to avoid oversmoothing. However, this leads to the underreaching problem, as discussed in Sections 1 and 3.5. The scattering transform, on the other hand, is able to utilize a large receptive field while avoiding oversmoothing. Therefore, it is able to simultaneously overcome both the oversmoothing and underreaching problems.
The family of AC-GNNs includes many popular GNN architectures such as GCN [5], GIN [36] and GraphSAGE [4]. However, importantly, we note that it does not include the scattering transform (because AC-GNNs only use local – one-step neighborhood – aggregations). We also note that this definition of an AC-GNN is quite similar to analogous notions used in, e.g., [32, 36].
Definition 1 (Aggregate-Combine GNN).
An aggregate-combine GNN (AC-GNN) is a network, where the update rule for each layer is defined on the node level according to
(20) |
for arbitrary functions ,
, , an (optional) activation function applied elementwise, and any function mapping a multi-set of vectors in (i.e., ) to a vector in .
As discussed in Section 3.3, the aggregations used in many popular AC-GNNs effectively act as localized averaging operators and focus primarily on low-frequency information. Therefore, deep AC-GNNs have the undesirable property of oversmoothing the input features if too many layers are used. As a result, most AC-GNNs typically only use two or three layers in order to avoid the oversmoothing problem. To understand this at an intuitive level, we follow the lead of [37] and consider a simplified version of GCN in which is the identity and the weight matrix is given by and also consider the network without the renormalization trick. In this case, the filter used in GCN is given by and so we have in Eq. 1. This implies that if is any vector which is orthogonal to the lead eigenvector, we have that the representation of after layers satisfies
Therefore, the output of a deep (simplified) GCN will converge exponentially fast to the projection of onto the bottom eigenspace, which is why GCNs with many layers typically achieve poor performance.
By contrast, the use of band-pass filters allows to incorporate a larger receptive field without essentially projecting the input features onto the bottom eigenspace. Consider for example the following result from [22].
Proposition 1 (Proposition 2.2 of [22]).
The wavelet filter bank introduced in Eq. 10 is a non-expansive frame with respect to the weighted norm defined by whose lower-frame bound is a universal constant independent of and the graph topology. That is, there exists a universal constant such that
The fact that the lower bound does not depend on is important because the receptive field of an -layer geometric scattering transform is . Therefore, if we choose to be sufficiently large, the scattering transform as formulated in [22] is able to incorporate long-range interaction in the network while still preserving a substantial portion of the input signal energy. Our construction differs slightly from [22] in that our wavelet filter bank includes only the but not . However, one may derive a result similar to Proposition 1, but where the lower bound depends on (but still does not depend on ). We refer the reader to the proof of Proposition 4.1 of [38] for details.444While [38] does not consider weighted norms, this technicality can be readily handled by imitating the proof of Proposition 2.2 of [22].. Unlike the scattering transform, in order to avoid oversmoothing, most AC-GNNs typically use networks with small receptive fields. While this does help avoid the problem of oversmoothing, it unfortunately creates the problem of underreaching. In the remainder of this section, we will focus on demonstrating that this underreaching diminishes the power of the network to discriminate different nodes in certain situations.
We will now begin our analysis of node discriminability555Formal definitions are provided in Appendix. A.1, , i.e., the ability of a network to produce different representations of two nodes and in its hidden layers in the context of the underreaching problem. We will let denote the -step node neighborhood5 of a node (including itself), and for we will let denote the corresponding induced subgraph.5 We will say two induced subgraphs and are isomorphic5 if they have identical geometric structure and write to indicate that is an isometry (geometry preserving map) between them. In the definition below, we introduce a class of graph-intrinsic node features that encode graph topology information. We will use these features as the input for GNN models in order to produce geometrically informed node representations.
Definition 2 (Intrinsic Node Features).
A nonzero node feature matrix is -intrinsic if for any such that is isomorphic to , we have .
These intrinsic node features encode important -local geometric information in the sense that if , then the -step neighborhoods of and must have different geometric structure. To further understand this definition, we observe that the degree vector, or any (elementwise) function of it, is a one-intrinsic node feature matrix. Setting to the average node degree of nodes in yields -intrinsic node features. As a slightly more complicated example, features with the number of triangles contained in are also -intrinsic. Fig. 4 illustrates how such node features can help distinguish nodes from different graph structures.



As a concrete example, consider the task of predicting traffic on a network of Wikipedia articles, such as the Chameleon dataset [39], with nodes and edges corresponding to articles and hyperlinks. Here, intrinsic node features provide insights into the topological context of each node. A high degree suggests a widely cited article, which is likely to have a lot of traffic, while counting -cliques or considering average degrees from nearby nodes can shed light on the density of the local graph region.
The following theorem characterizes situations where AC-GNNs cannot discriminate nodes based on the information represented in intrinsic node features. It can be seen as a formal formulation of the underreaching problem [32]. We note that an analogous result can also be established for graph attention networks, e.g., GAT [35] and GATv2 [40].
Theorem 1.
Let , , and such that . Let be any -intrinsic node feature matrix and let be the output of an -layer AC-GNN, i.e., , with each layer as in Eq. 20. Then, , and so and cannot be discriminated based on the filtered node features from an -layer AC-GNN.
Theorem 1 is proved by using an induction argument to show that for all and all . Full details can be found in the appendix A.2.
Next, we introduce a notion of structural difference that allows us to analyze situations where networks with wide receptive fields such as scattering networks have strictly greater discriminatory power than networks with smaller receptive fields.
Definition 3 (Structural Difference).
Notably, if the -step neighborhoods of two nodes and are -isomorphic and is any -intrinsic node feature matrix, then no structural difference relative to and can be manifested at . Theorem 2 stated below shows that a scattering network will be able to produce different representations of two nodes and if there are structural differences in their surrounding neighborhoods, assuming (among a few other mild assumptions) that a certain pathological case is avoided. The following notation and definition characterize this pathological situation that arises, roughly speaking, when two sums and coincide, even though and do not coincide. We note that although this condition seems complicated, it is highly unlikely to occur in practice.
Notation 1.
Let , , , such that . Assume that there exists at least one node in where a structural difference is manifested rel. to and node features . We define
and we fix the following notations:
-
(i)
Let and define the node set . Note that these are exactly the nodes in where a structural difference is manifested relative to and .
-
(ii)
Let be the number of paths of minimal length between and nodes in , and denote these by with , .
-
(iii)
Let , and define the generalized path from to as .
-
(iv)
For , define .
For , and , defineand set
Definition 4 (No Coincidental Correspondence).
Let and as in Notation 1. We say that exhibits no coincidental correspondence (no-cc) if in each non-empty , , there exists at least one such that .
Theorem 2.
As in Theorem 1, let , , and such that , and consider any K-intrinsic node feature matrix . Further assume that there exists at least one structural difference rel. to and in , and let be the smallest positive integer such that a structural difference is manifested in . If the nonlinearity is strictly monotonic and exhibits no coincidental correspondence rel. to and , then one can define a scattering configuration such that scattering features defined as in Eq. 13 discriminate and .
Theorem 2 provides a large class of situations where scattering filters are able to discriminate between two nodes even if networks with smaller receptive fields are not. In particular, even if the step neighborhood of and are isomorphic, it is possible for there to exist a in this step neighborhood such that a -intrinsic node feature takes different values at and . Theorem 2 shows that scattering will produce differing representations of and , while Theorem 1 shows that an -layer AC-GNNs like GCN will not. In the remarks below we will discuss minor variations of Theorem 2. In the appendix A.5, we will also present a modified version of Theorem 2, where the assumption of no coincidental correspondence is replaced by an assumption on the (generalized) path between the structural differences and the investigated node.
Remark 1.
The absolute value operation is not monotonic. Therefore, Theorem 2 cannot be directly applied in this case. However, the above result and all subsequent results can be extended to the case as long as certain pathological cases are avoided. Namely, Theorem 2 will remain valid as long as the features at are assumed not to be the negatives of the features at , i.e., for nodes , and similarly we must modify Definition 4 to avoid, for example, the case where . We note that while this condition is complex, it is rarely violated in practice.
Remark 2.
We will provide a sketch of the proof of Theorem 2 here and provide details in the appendix. The key to the proof will be applying Lemma 1 stated below.
Lemma 1.
Let , , and such that . Assume there exists at least one node in where a structural difference is manifested relative to and to , and let be the smallest positive integer such that a structural difference is manifested in . Assume that exhibits no coincidental correspondence relative to and , and let be the generalized path as defined in Notation 1. Then, no structural difference is manifested in , relative to and filtered node features and there exists at least one node in and where a structural difference is manifested and .
Below, we provide a sketch of the main ideas of the proof. A complete proof is provided in Appendix A.3.
Proof Sketch.
We use induction to show that at every step the matrix propagates the information about the structural difference one node layer closer towards .
By Notation 1(i), we have . Therefore, the case is immediate. In the inductive step, it suffices to show
and for at least one and
for all under the assumption that these results are already valid for . This can be established by writing and using the inductive hypothesis together with the definition of .∎
We now use Lemma 1 to sketch the proof of Theorem 2. The complete proof is provided in the appendix A.4.
Proof Sketch for Theorem 2.
We need to show that we can choose the parameters in a way that guarantees . For simplicity, we set . In this case, since is strictly monotonic, and therefore injective, it suffices to show that we can construct with
(21) |
Using binary expansion, we may choose , , such that and set . Given , we define truncated paths and let for We also let be the empty path of length 0 and set
Recalling the generalized path defined in Notation 1, we will use induction to show that for at least one node in a structural difference is manifested, while no structural difference manifests in rel. to and for . Since and , this will imply Eq. 21 and thus prove the theorem. As with Lemma 1, the base case follows from Notation 1(i). In the inductive step, it suffices to show that
(22) |
for at least one and
(23) |
for all , under the assumption that these results are true for . Since and is increasing and therefore injective, it suffices to show that these results are preserved under multiplication by a wavelet matrix . This can be verified using Lemma 1 (with in place of ) and the inductive hypothesis. ∎
Remark 3.
In the proof of Theorem 2, for the sake of concreteness, we chose and such that . However, inspecting the proof we see that our analysis may also be extended to paths with and also that the choice of does not matter as long as is invertible. In particular, if the entries of are generated i.i.d. at random from a continuous probability distribution, the conclusions of the theorem will hold with probability one[41]. We also note that Theorems 1 and 2 may be extended to the case where the update rule features a bias term as in Eq. 15 and 16 since is injective.
Results analogous to Theorem 2 can likely also be proven for deep AC-GNNs with subsequent layers. The proof would closely follow the idea of Lemma 1 and would likely be slightly simpler than the proof of Theorem 2. Indeed, inspecting the proof of Theorem 2, we see that part of the reason we need both Eq. 22 and Eq. 23 is because the definition of involves subtraction. Therefore, establishing Eq. 23 could be skipped for an AC-GNN. However, we would need considerably stronger assumptions when working with a -layer AC-GNN . Apart from an assumption analogous to no coincidental correspondence, we would additionally need injectivity of all functions that characterize , following Definition 1. Moreover, scattering filters are a more practical choice for two major reasons. Firstly, Scattering filters nicely balance the trade-off between oversmoothing and underreaching and are able to utilize a broad receptive field while still preserving information contained in higher frequencies as explained in Proposition 1 and the subsequent discussion. Secondly, scattering filters exhibit wide receptive fields with significantly fewer trained parameters.
6 Empirical Results
We now present our empirical results starting with semi-supervised node classification. We first show that our Sc-GCN model [33] is able to overcome the oversmoothing and underreaching problems and exhibits superior performance to either a pure GCN Network or a pure scattering network. We then show that we can further improve performance, particularly on complex datasets, by using GSAN[34], which incorporates an attention mechanism. Finally, we also show that our models perform well on graph-level tasks.
6.1 Semi-Supervised Node Classification
Scattering GCN. We compare the performance of Sc-GCN to both the original GCN [5] as well as the closely related ChebNet [28] and a network which uses Gaussian random fields to propagate label information [42]. Additionally, we compare against [13], which uses a pure-scattering approach. Further, we also compare Sc-GCN to two methods designed, in part, to address the oversmoothing problem. In [9], this problem is directly addressed by using partially absorbing random walks [43] to slow the mixing of node features in regions of the graph with high connectivity. The GAT graph attention network from [17], on the other hand, addresses this problem indirectly via an attention mechanism which trains an adaptive node-wise weighting of the smoothing operation. Lastly, we compare to an SVM classifier which ignores the graph structure.
In our experiments, we used four popular datasets summarized in Tab. I. For full details on these datasets, please see, e.g., [44] for Citeseer, Cora, and Pubmed and [45] for DBLP. As discussed in [9], the oversmoothing problem is most acute in highly connected networks where information mixes rapidly and nodes quickly become indistinguishable. Therefore, we included networks with different sizes and connectivity structures and order the datasets by connectivity structure (with the most connected on the right). Consistent with our expectation, Tab. II and Fig. 5 show that the advantages of our network are most significant for the networks with high connectivity.
We implemented all of these methods based upon the code made available by the original authors, and, for fair comparison, we tuned and evaluated each of these methods using the standard splits provided for benchmark datasets. Whenever possible, we made sure that our results are consistent with previously reported accuracies. We tuned our method, both the hyperparameters and number of scattering and GCN channels, via a grid search (over the same set for all datasets), and we used the same cross-validation procedure for our method and the competing methods. In Appendix B, we provide further details and also perform an ablation study evaluating the importance of each component in our proposed architecture.
Dataset | Nodes | Edges | Features | Degrees | |
---|---|---|---|---|---|
Citeseer | 3,327 | 4,732 | 3,703 | 3.773.38 | 1.42 |
Cora | 2,708 | 5,429 | 1,433 | 4.905.22 | 2.00 |
Pubmed | 19,717 | 44,338 | 500 | 5.507.43 | 2.25 |
DBLP | 17,716 | 52,867 | 1639 | 6.979.35 | 2.98 |
Model | Citeseer | Cora | Pubmed | DBLP |
---|---|---|---|---|
Sc-GCN (ours) | 71.7 | 84.2 | 79.4 | 81.5 |
GAT [17] | 72.5 | 83.0 | 79.0 | 66.1 |
Partially absorbing [9] | 71.2 | 81.7 | 79.2 | 56.9 |
GCN [5] | 70.3 | 81.5 | 79.0 | 72.0 |
Chebyshev [28] | 69.8 | 78.1 | 74.4 | 57.3 |
Label Propagation [42] | 58.2 | 77.3 | 71.0 | 53.0 |
Graph scattering [13] | 67.5 | 81.9 | 69.8 | 69.4 |
Node features (SVM) | 61.1 | 58.0 | 49.9 | 48.2 |
We report test classification accuracy in Tab. II, which shows that our approach outperforms other methods on three out of the four considered datasets. Only on Citeseer, we are slightly outperformed by GAT. However, this is the dataset with the weakest connectivity structure (see Tab. I), while at the same time having the most informative node features (achieving 61.1% accuracy via linear SVM without considering graph information). In contrast, on DBLP, which exhibits the richest connectivity structure and least informative features (only 48.2% SVM accuracy), we significantly outperform GAT (by over 15% improvement). Notably, GAT itself significantly outperforms all other methods (by 6.8% or more) apart from the graph scattering baseline [13].








The task of semi-supervised learning is motivated by the fact that in real-world applications one typically only has labels for a small fraction of the nodes. Therefore, we next analyze the performance of our network with limited amounts of training data. Fig. 5 (top) plots the training accuracy of Sc-GCN, as well as the other networks (except the baselines), as the amount of training data varies from 20% to 100% of its original amount. Across all networks, we see that Sc-GCN exhibits the best performance when the training data is sufficiently limited. Moreover, we see that it outperforms GAT, which is the next best performing network in terms of overall accuracy (see Tab. II), on all datasets whenever the training data is reduced to 60% of its original size or less.
In Fig. 5 (bottom), we plot the evolution of the validation error during the training process. We note that Sc-GCN is able to reach a low training error significantly faster than GAT, which is the next most competitive in terms of overall accuracy. Several other methods do achieve low validation errors faster on several datasets, but their final accuracy is much lower than that of Sc-GCN. Moreover, on Pubmed, which is the largest dataset, Sc-GCN achieves a low validation error at about the same rate as GCN and at a significantly faster rate than all other methods.
Dataset | Classes | Nodes | Edges | Homophily | GCN | GAT | Sc-GCN | GSAN |
Texas | 5 | 183 | 295 | 0.11 | 59.5 | 58.4 | 60.3 | 60.5 |
Chameleon | 5 | 2,277 | 31,421 | 0.23 | 28.2 | 42.9 | 51.2 | 61.2 |
CoraFull | 70 | 19,793 | 63,421 | 0.57 | 62.2 | 51.9 | 62.5 | 64.5 |
Wiki-CS | 10 | 11,701 | 216,123 | 0.65 | 77.2 | 77.7 | 78.1 | 78.6 |
Citeseer | 6 | 3,327 | 4,676 | 0.74 | 70.3 | 72.5 | 71.7 | 71.3 |
Pubmed | 3 | 19,717 | 44,327 | 0.80 | 79.0 | 79.0 | 79.4 | 79.8 |
Cora | 7 | 2,708 | 5,276 | 0.81 | 81.5 | 83.0 | 84.2 | 84.0 |
DBLP | 4 | 17,716 | 52,867 | 0.83 | 72.0 | 66.1 | 81.5 | 84.3 |
Geometric Scattering Attention Network. Having established the practical utility of the hybrid network Sc-GCN above, we next show that performance may be further improved by using GSAN, which builds upon Sc-GCN by incorporating an attention mechanism. We evaluate performance again for semi-supervised node classification and compare to two popular models (namely GCN [5] and GAT [17]), as well as the original Sc-GCN. We extend our experiments to eight benchmark datasets that we order according to increasing homophily, as shown in Tab. III. The homophily of a graph is quantified by the fraction of edges that connect nodes that share the same label. High homophily indicates a relatively smooth label distribution, while low homophily suggests a more complex labeling pattern. Texas and Chameleon are low-homophily datasets where nodes correspond to webpages and edges to links between them, with classes corresponding to webpage topic or monthly traffic (discretized into five levels), respectively [46]. Wiki-CS consists of nodes that represent computer science articles with the edges representing the hyperlinks [47]. CoraFull is the larger version of the Cora dataset [48]. The remaining four datasets (i.e., Cora, Citeseer, Pubmed, DBLP) are the citation networks that we already studied when evaluating Sc-GCN above. Notably, compared to the added datasets, the latter four exhibit relatively high homophily.
We split the datasets into train, validation and testing sets. The hyperparameters (number of attention heads , residual parameter and channel widths) are tuned via grid search using the validation set.
The results in Tab. III show that both Sc-GCN and GSAN outperform GCN and GAT on seven out of eight datasets. The advantages of GSAN over Sc-GCN are particularly notable on the medium-size and large datasets Chameleon and CoraFull that exhibit relatively low-homophily. The most striking result is for Chameleon. Here, Sc-GCN clearly outperforms GCN and GAT (by at least 8.2%), but GSAN considerably improves performance even more (by additional 10%).
Analyzing the node-wise attention weights (see Sec. 4.2) allows us to understand the importance of different channels for different datasets. We consider here the ratio between attention assigned to band-pass and low-pass channels (over all attention heads). For that, we first sum up attention over low-pass and band-pass channels, respectively, i.e., and . Then, for each node , we calculate the ratio . In Fig. 6, we present the distributions of for four datasets.

Examining the distributions, we observe four different dataset profiles. Citeseer and WikiCS exhibit minimal spread of attention ratios. In Citeseer, most of the attention is given to the low-pass channels, while WikiCS shows very balanced channel usage. In contrast, DBLP and Chameleon exhibit large spreads. Although the majority of nodes in both datasets values low-pass filters more, some nodes in Wiki-CS pay up to five times the attention to band-pass channels compared to low-pass ones (and vice versa). Interestingly, the distribution for Chameleon has two peaks, suggesting two node populations that require different band-information, which we interpret to be a driving factor for the improvement of GSAN on Chameleon (see Tab. III).
6.2 Graph Classification and Regression
Graph Classification. We use COLLAB and IMBD-BINARY as benchmark datasets. These are ego networks extracted from scientific collaborations and movie collaborations, respectively [49]. COLLAB contains 5,000 graphs with three classes and IMBD-BINARY contains 1,000 graphs with two classes. We use an 80-10-10 split between training, validation and test. The datasets contain graph structures but no associated graph signals. In our work, we compute the eccentricity (for connected graphs), degree and clustering coefficient of each vertex, and use these as input signals to our network.
We use a one-layer GSAN with 8 heads and 16 hidden units for COLLAB and one-layer GSAN with 4 heads and 16 hidden units for IMBD-BINARY, followed by one graph residual convolutional layer. We then apply a Set2Set module [50] followed by a multi-layer perceptron (MLP). Set2Set is a graph pooling architecture[51] where we apply an LSTM-based module for processing steps iteratively on the graph signal. During each processing step, node-wise feature signals with variable graph size are integrated using an LSTM-based attention mechanism. In our model, the GSAN-based layer outputs a graph signal and the Set2Set layer outputs a graph level feature vector for each graph. After the Set2Set layer, we apply an MLP layer on the graph level signal for the classification task. We use a Set2Set layer with 3 processing steps and a two-layer MLP for graph pooling. We also provide a pure-scattering baseline (Sc-only) based on the approach proposed in [13] together with an SVM classifier. Our results in Tab. IV show that both proposed hybrid scattering networks show improvement over the compared baselines.
Dataset | GCN | GAT | Sc-only | Sc-GCN | GSAN |
---|---|---|---|---|---|
COLLAB | 0.592 | 0.523 | 0.640 | 0.690 | 0.704 |
IMBD-BINARY | 0.710 | 0.632 | 0.710 | 0.740 | 0.760 |
Dataset | GCN | GAT | Sc-only | Sc-GCN | GSAN |
---|---|---|---|---|---|
ZINC | 0.469 | 0.463 | 0.51 | 0.452 | 0.430 |
Lipophilcity | 1.05 | 0.950 | 1.19 | 1.03 | 1.02 |
Graph Regression. We use ZINC and Lipophilicity as benchmark datasets [52, 53]. For the ZINC dataset, the target is the penalized water-octanol partition coefficient of molecules. For the Lipophilcity dataset, the task is predicting the experimental octanol/water distribution coefficient of different molecules. ZINC contains 1,000 graphs and Lipophilcity contains 4,200 graphs. Graphs in ZINC have 75 input features and graphs in Lipophilcity have 9. We use a 80-10-10 split between training, validation and testing.
We again use a GSAN-based approach, this time consisting of two geometric scattering attention layers followed by one graph residual convolution layer. For ZINC, we use a 32-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 128 hidden units as the second layer. For Lipophilcity, we use a 4-head GSAN with 32 hidden units as the first scattering layer and an one-head GSAN with 96 hidden units as the second layer. The aggregation module uses an MLP, applied to the output node representations, producing a scalar output, followed by an average pooling on each molecule. Again, we provide a scattering-only baseline (Sc-only), where we used a modified version of GSAN, but with all low-pass channels removed, so that our network utilizes only scattering channels. Our results are presented in Tab. V, demonstrating the utility of our GSAN approach for this task as well.
7 Conclusion
Here, we studied GNN approaches for semi-supervised node classification and investigate some of the major limitations of today’s architectures. Many popular models are known to essentially rely on the low-pass filtering of graph signals. Based on this observation, we propose a novel hybrid GNN framework that can leverage higher frequency information not captured by traditional GCN models. Our construction is based on geometric scattering, a concept that was previously used mainly for graph classification, which we adapt to the node level. Our theoretical study suggests that scattering filters nicely balance the trade-off between oversmoothing and underreaching. Therefore, we are able to theoretically establish that the resulting scattering filters are more sensitive to the graph topology than a large class or GNN architectures including GCN when used in conjunction with node features that encode graph structure information. Empirically, we first evaluated our Sc-GCN model and demonstrated its efficacy in alleviating so-called oversmoothing. We then evaluated our GSAN model, which further improves performance, particularly in more complex (low-homophily) settings, via an attention framework. We also provide evidence that the proposed hybrid scattering networks perform well in graph-level tasks, both in classification and regression. In future work, one might further explore the potential of features that carry graph structure information empirically, and analyze more datasets in the context of graph-level tasks.
Acknowledgments
The authors would like to thank Dongmian Zou for fruitful discussions. This work was partially funded by Fin-ML CREATE graduate studies scholarship for PhD [F.W.]; IVADO (Institut de valorisation des données) grant PRF-2019-3583139727, FRQNT (Fonds de recherche du Québec - Nature et technologies) grant 299376, Canada CIFAR AI Chair [G.W.]; NSF (National Science Foundation) grant DMS-1845856 [M.H.]; and NIH (National Institutes of Health) grant NIGMS-R01GM135929 [M.H.,G.W.]. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
References
- [1] M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
- [2] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv:2104.13478, 2021.
- [3] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
- [4] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
- [5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2016.
- [6] A. M. Fout, “Protein interface prediction using graph convolutional networks,” Ph.D. dissertation, Colorado St. Univ., 2017.
- [7] N. De Cao and T. Kipf, “Molgan: An implicit generative model for small molecular graphs,” arXiv:1805.11973, 2018.
- [8] B. Knyazev, X. Lin, M. R. Amer, and G. W. Taylor, “Spectral multigraph networks for discovering and fusing relationships in molecules,” arXiv:1811.09595, 2018.
- [9] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). Association for the Advancement of Artificial Intelligence, 2018, pp. 3538–3545.
- [10] H. Nt and T. Maehara, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv:1905.09550, 2019.
- [11] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” in International Conference on Learning Representations, 2019.
- [12] F. Gao, G. Wolf, and M. Hirn, “Geometric scattering for graph data analysis,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2122–2131.
- [13] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,” Applied and Computational Harmonic Analysis, vol. 49, no. 3, pp. 1046–1074, 2020.
- [14] S. Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
- [15] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
- [16] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 53–94, 2006.
- [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
- [18] R. Liao, Z. Zhao, R. Urtasun, and R. Zemel, “Lanczosnet: Multi-scale deep graph convolutional networks,” in the 7th International Conference on Learning Representations (ICLR), 2019.
- [19] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing,” in international conference on machine learning. PMLR, 2019, pp. 21–29.
- [20] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” arXiv:1904.07785, 2019.
- [21] F. Gama, A. Ribeiro, and J. Bruna, “Stability of graph scattering transforms,” in NeurIPS, 2019, pp. 8038–8048.
- [22] M. Perlmutter, F. Gao, G. Wolf, and M. Hirn, “Understanding graph neural networks with asymmetric geometric scattering transforms,” arXiv:1911.06253, 2019.
- [23] D. Zou and G. Lerman, “Encoding robust representation for graph generation,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–9.
- [24] E. Castro, A. Benz, A. Tong, G. Wolf, and S. Krishnaswamy, “Uncovering the folding landscape of rna secondary structure using deep graph embeddings,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 4519–4528.
- [25] D. Bhaskar, J. D. Grady, M. A. Perlmutter, and S. Krishnaswamy, “Molecular graph generation via geometric scattering,” arXiv:2110.06241, 2021.
- [26] A. Tong, F. Wenkel, K. MacDonald, S. Krishnaswamy, and G. Wolf, “Data-driven learning of geometric scattering networks,” arXiv:2010.02415, 2020.
- [27] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR 2014, 2014.
- [28] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 3844–3852.
- [29] R. Levie, W. Huang, L. Bucci, M. Bronstein, and G. Kutyniok, “Transferability of spectral graph convolutional neural networks,” Journal of Machine Learning Research, vol. 22, no. 272, pp. 1–59, 2021.
- [30] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Accelerated filtering on graphs using Lanczos method,” 2015, arXiv:1509.04537.
- [31] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.
- [32] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva, “The logical expressiveness of graph neural networks,” in International Conference on Learning Representations, 2020.
- [33] Y. Min, F. Wenkel, and G. Wolf, “Scattering gcn: Overcoming oversmoothness in graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- [34] ——, “Geometric scattering attention networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8518–8522.
- [35] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in the 6th International Conference on Learning Representations (ICLR), 2018.
- [36] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2018.
- [37] H. Wang and J. Leskovec, “Unifying graph convolutional neural networks and label propagation,” arXiv:2002.06755, 2020.
- [38] F. Gama, A. Ribeiro, and J. Bruna, “Diffusion scattering transforms on graphs,” arXiv:1806.08829, 2018.
- [39] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” Journal of Complex Networks, vol. 9, no. 2, p. cnab014, 2021.
- [40] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” in International Conference on Learning Representations, 2021.
- [41] M. Rudelson, “Invertibility of random matrices: Norm of the inverse,” ANNALS OF MATHEMATICS, 2008.
- [42] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919.
- [43] X.-M. Wu, Z. Li, A. M. So, J. Wright, and S. fu Chang, “Learning with partially absorbing random walks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., 2012, pp. 3077–3085.
- [44] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning. PMLR, 2016, pp. 40–48.
- [45] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” Network, vol. 11, no. 9, p. 12, 2016.
- [46] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” arXiv:2002.05287, 2020.
- [47] P. Mernyei and C. Cangea, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv:2007.02901, 2020.
- [48] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in International Conference on Learning Representations, 2018, pp. 1–13.
- [49] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
- [50] O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” arXiv:1511.06391, 2015.
- [51] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- [52] J. J. Irwin and B. K. Shoichet, “Zinc- a free database of commercially available compounds for virtual screening,” Journal of chemical information and modeling, vol. 45, no. 1, pp. 177–182, 2005.
- [53] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513–530, 2018.
- [54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
![]() |
Frederik Wenkel received the B.Sc. degree in Mathematics and the M.Sc. degree in Mathematics at Technical University of Munich, in 2019. He is currently a Ph.D. candidate in Applied Mathematics at Université de Montréal (UdeM) and Mila (the Quebec AI institute), working on geometric deep learning. In particular, he is interested in graph neural networks and their applications in domains such as social networks, bio-chemistry and finance. |
![]() |
Yimeng Min is a Ph.D. student in Computer Science at Cornell University. He received his B.Sc. degree in Physics from Nanjing University and M.A.Sc. degree in Electrical and Computer Engineering from U. of Toronto. His research area is Artificial Intelligence with a focus on graph neural networks and constraint reasoning. |
![]() |
Matthew Hirn is an Associate Professor in the Dept. of CMSE and the Dept. of Mathematics at Michigan State University. He is the leader of the ComplEx Data Analysis Research (CEDAR) team, which develops new tools in computational harmonic analysis, machine learning, and data science for the analysis of complex, high dimensional data. Hirn received his B.A. in Mathematics from Cornell University and his Ph.D. in Mathematics from the University of Maryland, College Park. Before arriving at MSU, he held postdoctoral appointments in the Applied Math Program at Yale University and in the Department of Computer Science at École Normale Supérieure, Paris. He is the recipient of the Alfred P. Sloan Fellowship (2016), the DARPA Young Faculty Award (2016), the DARPA Director’s Fellowship (2018), and the NSF CAREER award (2019), and was designated a Kavli Fellow by the National Academy of Sciences (2017). |
![]() |
Michael Perlmutter is a Hedrick Assistant Adjunct Professor in the Dept. of Mathematics at the UCLA. He has also held postdoctoral positions in the Dept. of CMSE at Michigan State University and in the Dept. of Statistics and Operations Research at the University of North Carolina at Chapel Hill. He earned his Ph.D. in Mathematics from Purdue University in 2016. His research uses the methods of applied probability and harmonic analysis to develop and analyze methods for data sets with geometric structure. |
![]() |
Guy Wolf is an associate professor in the Depat. of Mathematics and Statistics (DMS) at the Université de Montréal (UdeM), a core academic member of Mila (the Quebec AI institute), and holds a Canada CIFAR AI Chair. He is also affiliated with the CRM center of mathematical sciences and the IVADO institute of data valorization. He holds an M.Sc. and a Ph.D. in computer science from Tel Aviv University. Previously, he was a postdoctoral researcher (2013-2015) in the Dept. of Computer Science at École Normale Supérieure in Paris (France), and a Gibbs Assistant Professor (2015-2018) in the Applied Math. Program at Yale University. His research focuses on manifold learning and geometric deep learning for exploratory data analysis, including methods for dimensionality reduction, visualization, denoising, data augmentation, and coarse graining. Also, he is particularly interested in biomedical data exploration applications of such methods, e.g., in single cell genomics/proteomics and neuroscience. |
Appendix A
A.1 Supplemental Definitions for Sec. 5
Here, we provide additional definitions and notations.
Definition 5 (Node Discriminability).
Let be a node feature matrix, and let for some graph filter . We say that discriminates two nodes based on if .
Definition 6 (K-step Neighborhood).
For ,777We use the conventions , and , for . we define the K-step neighborhood of as (in particular and ). We further write (in particular ).
Definition 7 (Induced Subgraph).
Given a set , we refer to a subgraph of as an induced subgraph, if is the set of all edges that connect nodes from , i.e., . In this case, we will write .
Definition 8 (Graph Isomorphism).
A graph isomorphism between graphs and is a bijection such that if and only if . We say that and are -isomorphic and write . We also define for .
Definition 9 (Boundary and Interior of Induced Subgraph).
Let be a an induced subgraph of . We define the boundary of as Further, we define the interior of as . When convenient, we will also use the notation and .
Definition 10 (Spatial Support of Graph Filter).
We say that a graph filter has spatial support if the value of depends only on values of at nodes within steps of , i.e., if for any and any node feature matrices and , we have whenever for all .
A.2 Proof of Theorem 1
We use induction to show that for all and all .
: Let . By definition, we have and so the triangle inequality implies that This implies that and therefore the assumption that is -intrinsic implies that .
: Now assume that for all , and let . Since , we have . Moreover, since the degree is 1-intrinsic, we have and therefore . Since we have . Therefore, since agg is a function defined on a multi-set the result will follow from Eq. 20 once we show that for all . Towards this end, we note that for all by the inductive assumption and the fact that . Similarly, since the degree is 1-intrinsic we have and , and thus the proof is complete. ∎
A.3 Proof of Lemma 1
We will proceed by induction over .
: Using the notation established in Notation 1, we have that , and as noted in (i), this is exactly the set of nodes in where a structural difference manifests rel. to and . Also note that for according to (iv) of the notation. Thus the lemma holds or .
: Assume Lemma 1 to hold for some . Thus, by the definition of structural difference, and as for , our claim is equivalent to showing
(24) |
and for at least one and
(25) |
for all .
Since we may use Eq. 8 to see that Eq. 24 is equivalent to
(26) |
for at least one . One may check that . Therefore, inductively applying Eq. 25 (with in place of ) yields that . Additionally, the inductive assumption implies that for all . Therefore, Eq. A.3 is equivalent to
(27) |
This indeed holds for at least one as Eq. 27 is equivalent to
(28) |
and because exhibits no-cc with according to the inductive hypothesis. This completes the proof of Eq. 24.
Similarly Eq. 25 is equivalent to showing
(29) |
for all . We first note that , and that , since otherwise we would have Therefore, the inductive hypothesis implies . Moreover, no can be an element of since otherwise, would be an element of , which is a contradiction. Therefore, the inductive assumption implies for all . We have already noted that for all . Thus, Eq. 29 holds and the proof is complete. ∎
A.4 Proof of Theorem 2
We need to show that we can choose the parameters in a way that guarantees . For simplicity, we set . In this case, since strictly monotonic, and therefore injective, it suffices to show that we can construct such that
(30) |
Using binary expansion, we may choose , , such that . We will show that Eq. 30 holds for For let and let , where denotes the empty path of length 0 and
Recall the generalized path defined in Notation 1 (iii). We will use induction to show that, for , there exists at least one node in where a structural difference is manifested, while no structural differences manifests in rel. to and Since and , this claim will imply Eq. 30 and thus prove the theorem.
: Analogous to the proof of Lemma 1, using the notation established in Notation 1 (ii), we have that , which are exactly the nodes in where a structural difference manifests. Thus the claim holds for .
: We now assume the result holds for . Since the degree is one-intrinsic and , we have that for all . Therefore, the claim is equivalent to showing
(31) |
for at least one and
(32) |
for all .
By the inductive assumption, Eq. 31 and Eq. 32 hold with in place of and since is injective, they continue to hold when is replaced with . Therefore, we may apply Lemma 1 with in place of , and . By the inductive assumption, we have that as defined in Lemma 1 is given by . Therefore, and . Therefore, there exists at least one where a structural difference is manifested, while for all , no structural difference is manifested rel. to and , i.e.,
(33) |
and
(34) |
Next, we again apply Lemma 1 with in place of , but this time with . We observe now that and thus . Therefore, no structural difference is manifested at either or rel. to and , i.e.,
(35) |
and
(36) |
Together Eq. 33-34 and Eq. 35-36 imply
and
A.5 Modified Variant of Theorem 2
Theorem 3.
Similar to Theorem 2, let , and such that , and consider any K-intrinsic node feature matrix . Suppose there exist nodes where a structural difference rel. to and is manifested. If the nonlinearity is strictly monotonic and there exists a unique with minimal distance from and a unique shortest path between and , we can define a scattering configuration such that scattering features defined as in Eq. 13 discriminate and .
In Theorem 3, we replace the assumption of no coincidental correspondence (Definition 4) in Theorem 2 by the assumption of a unique shortest path between and some node , which is the unique nearest node from , where a structural difference is manifested.
Proof of Theorem 3.
The only difference between Theorem 3 and Theorem 2 is that in Theorem 2 we assume that there is no coincidental correspondence whereas here we assume that there is a unique with minimal distance from and that there is a unique shortest path between and . Inspecting the proof of Theorem 2, we see that the only spot where we used the assumption of no coincidental correspondence was when we invoked Lemma 1. Therefore, it suffices to show that the conclusion of Lemma 1 holds under our revised assumptions. Moreover, inspecting the proof of Lemma 1, we see that the only spot we use the assumption of no coincidental correspondence was to establish Eq. 28. Thus, it suffices to show that Eq. 28 still holds.
Let be the generalized path from to as defined in Notation 1, and note that our assumption of a unique shortest path implies that each consists of a single node , i.e., . Since the assumption of no coincidental correspondence is not used in the base case, we can use the inductive hypothesis that is the unique element of where a structural difference manifests with respect to . We are then left to show that this implies that Eq. 28 holds. But Eq. 28 simplifies to , which is true according to the inductive hypothesis. ∎
Appendix B
B.1 Technical Details
Citeseer | 0.50 | 4 | 10 | 10 | 10 | 9 | 30 | ||
Cora | 0.35 | 4 | 10 | 10 | 10 | 11 | 6 | ||
Pubmed | 1.00 | 4 | 10 | 10 | 10 | 13 | 14 | ||
DBLP (1st layer) | 1.00 | 4 | 10 | 10 | 10 | 30 | 30 | ||
DBLP (2nd layer) | 0.10 | 1 | 40 | 20 | 20 | 20 | 20 |
Number of heads | Channel width | ||
---|---|---|---|
Texas | 1.5 | 4 | 128 |
Chameleon | 0.2 | 64 | 32 |
CoraFull | 0.5 | 20 | 128 |
Wiki-CS | 0.3 | 20 | 20 |
Citeseer | 0.1 | 8 | 64 |
Pubmed | 0.1 | 50 | 64 |
Cora | 0.1 | 50 | 64 |
DBLP | 0.2 | 16 | 128 |
As with most neural networks, when implementing Sc-GCN and GSAN, one must make several architecture choices and tune several hyperparameters. In our experiments with Sc-GCN, we used either one or two hybrid layers, each comprised of three GCN channels and two scattering channels, followed by a single residual convolution layer. This fairly simple setup simplified the network tuning process and was sufficient to obtain strong results, which outperformed numerous competing methods. However, it is likely that performance can be further improved by using wider or deeper networks. For Cora, Citeseer and Pubmed we used only one hybrid layer as preliminary experiments indicated that the addition of a second one was not cost-effective (in light of additional complexity created by the need to tune more hyperparameters). For DBLP, two layers were used due to a significant increase in performance. We note that even with a single hybrid layer our model achieves test accuracy (compared to with two layers) and still significantly outperforms GAT () and the other methods (below ).
To illustrate the utility of our model on large and challenging data sets, we now compare against GCN and GAT on two OGB benchmark node classification tasks. Our results are shown in Tab VIII. GSAN achieves 73.9% using 3,832 parameters on ogbn-proteins and 72.3% on ogbn-arxiv using 70,760 parameters, while GCN achieves 72.5% ogbn-proteins and using 110,120 parameters and 71.7% on ogbn-arxiv using 96,880 parameters, GAT achieves 73.6% on ogbn-proteins using 1,238,554 parameters and 71.2% on on ogbn-arxiv using 950,620 parameters. This suggests that our model can be more parameter-efficient on large datasets, especially on ogbn-proteins dataset, where we only use 3.5% of the GCN’s parameter counts. Furthermore, GSAN achieves 76.2% on ogbn-proteins using 31,856 when we increase the width of GSAN. Thus, we see that our network is able to get better performance with fewer learned parameters.
Dataset | #nodes | #edges | GCN | GAT | Sc-GCN | GSAN |
---|---|---|---|---|---|---|
proteins | 1.3e5 | 4.0e7 | 0.725 | 0.736 | 0.742 | 0.739 |
arxiv | 1.7e5 | 1.1e6 | 0.717 | 0.712 | 0.719 | 0.723 |
Validation & testing procedure: All tests were done using train-validation-test splits of the datasets. We used the validation accuracy to tune the hyperparameters and reported the test accuracy in the comparison tables. To ensure fair comparison, we used the same splits for all methods. On Citeseer, Cora and Pubmed, we used the same settings as in [5], and followed the standard practices used in other works which have used these datasets. For DBLP, as far as we know, no common standard is established in the literature. Here, we used a ratio of between train, validation, and test for all node-classification datasets. For the datasets for graph classification and regression, we used a ratio of .
Hyperparameter tuning: We tuned our hyperparameters on each set using a grid search and selected the setting which yields the highest accuracy on the validation set.
In Sc-GCN, we used the grid search to tune the parameter used in the residual convolution, the exponent used in the nonlinearity, the scattering channel configurations (i.e., scales used in these two channels), and the widths of channels in the network, i.e. the hidden dimensions. The results of this tuning process are presented in Tab. VI. For DBLP, the hybrid-layer parameters are shared between the two layers in order to simplify the tuning process. This was generally less exhaustive than the process used for the other three datasets, since our method significantly outperformed all other methods even with limited tuning.
0.01 | 0.1 | 0.5 | |
Accuracy | |||
1.0 | 3.0 | 5.0 | |
Accuracy |
For GSAN, the hyperparameter selection is less intricate, which in addition to improved performance, is one of the main advantages compared to Sc-GCN. In particular, we use the same scattering configurations and channel widths across all channels in a given layer. We do not tune the scattering configurations. We always set and always use the wavelets and in our three scattering channels. Therefore, the hyperparameters simplify to , a (universal) channel width and the number of attention heads. These parameters might differ across different layers if the architecture relies on several ones. The results of this tuning process are presented in Tab. VII.
Hardware & software environment: All experiments were done on the same HPC cluster with intel i7-6850K CPU and NVIDIA TITAN X Pascal GPU. Both Sc-GCN and GSAN were implemented using PyTorch [54]. Implementations of all other methods were taken directly from the code accompanying their publications.
B.2 Ablation Study
The two primary novelties in Sc-GCN and GSAN are the scattering channels (i.e., and ) and the residual convolution (controlled by the hyperparameter ). To explore their contribution and the dependence of the networks performance on the hyperparameters, Tab. IX show classification results over the ogbn-proteins dataset for over multiple scattering channel configurations. For simplicity, we focus on ogbn-proteins, but note that similar results are observed on the other datasets.
Channel | ||||||
---|---|---|---|---|---|---|
Accuracy | 71.22 | 71.23 | 72.16 | 71.47 | 72.98 | 70.75 |
To examine the importance of the parameter used in the residual graph convolution layer, we observe that setting effectively eliminates this layer (since then ). On the other hand, increasing makes the filtering effect stronger, approaching a random-walk filtering as . We evaluate the effect of this parameter on classification accuracy. Our results, displayed in Tab. IX, indicate that increasing to non-negligible nonzero values improves classification performance, which we interpret to be due to the removal of high-frequency noise. However, when further increases (in particular when in this case) the smoothing provided by this layer degrades the performance to a level close to the traditional GCN [5].
Finally, to examine the relative importance of the low-pass and band-pass information, we provide an ablation study of the impact each channel. We focus on the ogbn-proteins dataset, set and use the best channel configuration, which achieved 73.93% accuracy when all channels were used. Then, we remove each of the band-pass or low-pass channels individually from the network, and reevaluate network performance with the remaining four channels. Our results are displayed in Tab. X. They indicate that the information captured by both the band-pass and the low-pass channels is important and that eliminating either decreases performance. In particular, we note that eliminating has a major impact on the accuracy.
B.3 Implementation
Python code accompanying this work is available on github.com/dms-net/scatteringGCN and github.com/dms-net/Attention-based-Scattering.