Dirichlet Energy Constrained Learning for Deep Graph Neural Networks

Kaixiong Zhou
Texas A&M University
zkxiong@tamu.edu
&Xiao Huang
The Hong Kong Polytechnic University
xiaohuang@comp.polyu.edu.hk
&Daochen Zha
Texas A&M University
daochen.zha@tamu.edu
&Rui Chen
Samsung Research America
rui.chen1@samsung.com
&Li Li
Samsung Research America
li.li1@samsung.com

&Soo-Hyun Choi
Samsung Electronics
soohyunc@gmail.com

&Xia Hu
Texas A&M University
xiahu@tamu.edu

Abstract

Graph neural networks (GNNs) integrate deep architectures and topological structure modeling in an effective way. However, the performance of existing GNNs would decrease significantly when they stack many layers, because of the over-smoothing issue. Node embeddings tend to converge to similar vectors when GNNs keep recursively aggregating the representations of neighbors. To enable deep GNNs, several methods have been explored recently. But they are developed from either techniques in convolutional neural networks or heuristic strategies. There is no generalizable and theoretical principle to guide the design of deep GNNs. To this end, we analyze the bottleneck of deep GNNs by leveraging the Dirichlet energy of node embeddings, and propose a generalizable principle to guide the training of deep GNNs. Based on it, a novel deep GNN framework – EGNN is designed. It could provide lower and upper constraints in terms of Dirichlet energy at each layer to avoid over-smoothing. Experimental results demonstrate that EGNN achieves state-of-the-art performance by using deep layers.

1 Introduction

Graph neural networks (GNNs) [1] are promising deep learning tools to analyze networked data, such as social networks [2, 3, 4], academic networks [5, 6, 7], and molecular graphs [8, 9, 10]. Based on spatial graph convolutions, GNNs apply a recursive aggregation mechanism to update the representation of each node by incorporating representations of itself and its neighbors [11]. A variety of GNN variations have been explored for different real-world networks and applications [12, 13].

A key limitation of GNNs is that when we stack many layers, the performance would decrease significantly. Experiments show that GNNs often achieve the best performance with less than $3$ layers [14, 12]. As the layer number increases, the node representations will converge to indistinguishable vectors due to the recursive neighborhood aggregation and non-linear activation [15, 16]. Such phenomenon is recognized as over-smoothing issue [17, 18, 19, 20, 21]. It prevents the stacking of many layers and modeling the dependencies to high-order neighbors.

A number of algorithms have been proposed to alleviate the over-smoothing issue and construct deep GNNs, including embedding normalization [22, 23, 24], residual connection [25, 26, 27] and random data augmentation [28, 29, 30]. However, some of them are motivated directly by techniques in convolutional neural networks (CNNs) [31], such as the embedding normalization and residual connection. Others are based on heuristic strategies, such as random embedding propagation [29] and dropping edge [28]. Most of them only achieve comparable or even worse performance compared to their shallow models. Recently, a metric of Dirichlet energy has been applied to quantify the over-smoothing [32], which is based on measuring node pair distances. With the increasing of layers, the Dirichlet energy converges to zero since node embeddings become close to each other. But there is a lack of methods to leverage this metric to overcome the over-smoothing issue.

Therefore, it remains a non-trivial task to train a deep GNN architecture due to three challenges. First, the existing efforts are developed from diverse perspectives, without a generalizable principle and analysis. The abundance of these components also makes the design of deep GNNs challenging, i.e., how should we choose a suitable one or combinations for real-world scenarios? Second, even if an effective indicator of over-smoothing is given, it is hard to theoretically analyze the bottleneck and propose a generalizable principle to guide the training of deep GNNs. Third, even if theoretical guidance is given, it may be difficult to be utilized and implemented to train GNNs empirically.

To this end, in this paper, we target to develop a generalizable framework with a theoretical basis, to handle the over-smoothing issue and enable effective deep GNN architectures. In particular, we will investigate two research questions. 1) Is there a theoretical and generalizable principle to guide the architecture design and training of deep GNNs? 2) How can we develop an effective architecture to achieve state-of-the-art performance by stacking a large number of layers? Following these questions, we make three major contributions as follows.

•

We propose a generalizable principle – Dirichlet energy constrained learning, to guide the training of deep GNNs by regularizing Dirichlet energy. Without proper training, the Dirichlet energy would be either too small due to the over-smoothing issue, or too large when the node embeddings are over-separating. Our principle carefully defines an appropriate range of Dirichlet energy at each layer. Being regularized within this range, a deep GNN model could be trained by jointly optimizing the task loss and energy value.
•

We design a novel deep architecture – Energetic Graph Neural Networks (EGNN). It follows the proposed principle and could efficiently learn an optimal Dirichlet energy. It consists of three components, i.e., orthogonal weight controlling, lower-bounded residual connection, and shifted ReLU (SReLU) activation. The trainable weights at graph convolutional layers are orthogonally initialized as diagonal matrices, whose diagonal values are regularized to meet the upper energy limit during training. The residual connection strength is determined by the lower energy limit to avoid the over-smoothing. While the widely-used ReLU activation causes the extra loss of Dirichlet energy, the linear mapping worsens the learning ability of GNN. We apply SReLU with a trainable shift to provide a trade-off between the non-linear and linear mappings.
•

We show that the proposed principle and EGNN can well explain most of the existing techniques for deep GNNs. Empirical results demonstrate that EGNN could be easily trained to reach $64$ layers and achieves surprisingly competitive performance on benchmarks.

2 Problem Statement

Notations.

Given an undirected graph consisting of $n$ nodes, it is represented as $G=(A,X)$ , where $A\in\mathbb{R}^{n\times n}$ denotes the adjacency matrix and $X\in\mathbb{R}^{n\times d}$ denotes the feature matrix. Let $\tilde{A}:=A+I_{n}$ and $\tilde{D}:=D+I_{n}$ be the adjacency and degree matrix of the graph augmented with self-loops. The augmented normalized Laplacian is then given by $\tilde{\Delta}:=I_{n}-\tilde{P}$ , where $\tilde{P}:=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}$ is an augmented normalized adjacency matrix used for the neighborhood aggregation in GNN models.

Node classification task.

GNNs have been adopted in many applications [6, 33, 34]. Without loss of generality, we take node classification as an example. Given a graph $G=(A,X)$ and a set of its nodes with labels for training, the goal is to predict the labels of nodes in a test set.

We now use the graph convolutional network (GCN) [14] as a typical example, to illustrate how traditional GNNs perform the network analysis task. Formally, the layer-wise forward-propagation operation in GCN at the $k$ -th layer is defined as:

X^{(k)}=\sigma(\tilde{P}X^{(k-1)}W^{(k)}).

(1)

$X^{(k)}$ and $X^{(k-1)}$ are node embedding matrices at layers $k$ and $k-1$ , respectively; $W^{(k)}\in\mathbb{R}^{d\times d}$ denotes trainable weights used for feature transformation; $\sigma$ denotes an activation function such as ReLU; $X^{(0)}=X$ at the initial layer of GCN. The embeddings at the final layer are optimized with a node classification loss function, e.g., cross-entropy loss. The recursive neighborhood aggregation in Eq. (1) will make node embeddings similar to each other as the number of layer $k$ increases. This property, i.e., over-smoothing, prevents traditional GNNs from exploring neighbors many hops away. In practice, dependencies to high-order neighbors are important to the node classification. The traditional shallow GNNs may have sub-optimal performances in the downstream tasks [15, 27].

3 Dirichlet Energy Constrained Learning

In this paper, we aim to develop an effective principle to alleviate the over-smoothing issue and enable deep GNNs to leverage the high-order neighbors. We first theoretically analyze the over-smoothing issue, and then provide a principle to explain the key constraint in training deep GNNs.

Node pair distance has been widely adopted to quantify the over-smoothing based on embedding similarities [18, 22]. Among the series of distance metrics, Dirichlet energy is simple and expressive for the over-smoothing analysis [32]. Thus, we adopt Dirichlet energy and formally define it as below.

Definition 1.

Given node embedding matrix $X^{(k)}=[x^{(k)}_{1},\cdots,x^{(k)}_{n}]^{\top}\in\mathbb{R}^{n\times d}$ learned from GCN at the $k$ -th layer, the Dirichlet energy $E(X^{(k)})$ is defined as follows:

E(X^{(k)})=\text{tr}({X^{(k)}}^{\top}\tilde{\Delta}X^{(k)})=\frac{1}{2}\sum a_{ij}||\frac{x^{(k)}_{i}}{\sqrt{1+d_{i}}}-\frac{x^{(k)}_{j}}{\sqrt{1+d_{j}}}||_{2}^{2},

(2)

where $\text{tr}(\cdot)$ denotes trace of a matrix; $a_{ij}$ is edge weight given by the $(i,j)$ -th element in matrix $A$ ; $d_{i}$ is node degree given by the $i$ -th diagonal element in matrix $D$ . Dirichlet energy reveals the embedding smoothness with the weighted node pair distance. While a smaller value of $E(X^{(k)})$ is highly related to the over-smoothing, a larger one indicates that the node embeddings are over-separating even for those nodes with the same label. Considering the node classification task, one would prefer to have an appropriate Dirichlet energy at each layer to separate the nodes of different classes while keeping those of the same class close. However, under some conditions, the upper bound of Dirichlet energy is theoretically proved to converge to $0$ in the limit of infinite layers [32]. In other words, all nodes converge to a trivial fixed point in the embedding space.

Based on the previous analysis, we derive the corresponding lower bound and revisit the over-smoothing/separating problem from the model design and training perspectives. To simplify the derivation process, we remove the non-linear activation $\sigma$ , and re-express GCN as: $X^{(k)}=P\cdots PXW^{(1)}\cdots W^{(k)}$ . The impact of non-linear function will be considered in the model design.

Lemma 1.

The Dirichlet energy at the $k$ -th layer is bounded as follows:

(1-\lambda_{1})^{2}s^{(k)}_{\min}E(X^{(k-1)})\leq E(X^{(k)})\leq(1-\lambda_{0})^{2}s^{(k)}_{\max}E(X^{(k-1)}).

(3)

The detailed proof is provided in the Appendix. $\lambda_{1}$ and $\lambda_{0}$ are the non-zero eigenvalues of matrix $\tilde{\Delta}$ that are most close to values $1$ and $0$ , respectively. $s^{(k)}_{\min}$ and $s^{(k)}_{\max}$ are the squares of minimum and maximum singular values of weight $W^{(k)}$ , respectively. Note that the eigenvalues of $\tilde{\Delta}$ vary with the real-world graphs, and locate within range $[0,2)$ . We relax the above bounds as blow.

Lemma 2.

The lower and upper bounds of Dirichlet energy at the $k$ -th layer could be relaxed as:

0\leq E(X^{(k)})\leq s^{(k)}_{\max}E(X^{(k-1)}).

(4)

Besides the uncontrollable eigenvalues determined by the underlying graph, it is shown that the Dirichlet energy can be either too small or too large without proper design and training on weight $W^{(k)}$ . On one hand, based on the common Glorot initialization [35] and L2 regularization, we empirically find that some of the weight matrices approximate to zero in a deep GCN. The corresponding square singular values are hence close to zero in these intermediate layers. That means the Dirichlet energy will become zero at the higher layers of GCN and causes the over-smoothing issue. On the other hand, without the proper weight initialization and regularization, a large $s^{(k)}_{\max}$ may lead to the energy explosion and the over-separating.

The Dirichlet energy plays a key role in training a deep GNN model. However, the optimal value of Dirichlet energy varies in the different layers and applications. It is hard to be specified ahead and then enforces the node representation learning. Therefore, we propose a principle – Dirichlet energy constrained learning, defined in Proposition 1. It provides appropriate lower and upper limits of Dirichlet energy. Regularized by such a given range, a deep GNN model could be trained by jointly optimizing the node classification loss and Dirichlet energy at each layer.

Proposition 1.

Dirichlet energy constrained learning defines the lower & upper limits at layer $k$ as:

c_{\min}E(X^{(k-1)})\leq E(X^{(k)})\leq c_{\max}E(X^{(0)}).

(5)

We apply the transformed initial feature through trainable function $f$ : $X^{(0)}=f(X)\in\mathbb{R}^{n\times d}$ . Both $c_{\min}$ and $c_{\max}$ are positive hyperparameters. From interval $(0,1)$ , hyperparameter $c_{\min}$ is selected by satisfying constraint of $E(X^{(k)})\geq c^{k}_{\min}E(X^{(0)})>0$ . In such a way, the over-smoothing is overcome since the Dirichlet energies of all the layers are larger than appropriate limits related to $c^{k}_{\min}$ . Compared with the initial transformed feature $X^{(0)}$ , the intermediate node embeddings of the same class are expected to be merged closely to have a smaller Dirichlet energy and facilitate the downstream applications. Therefore, we exploit the upper limit $c_{\max}E(X^{(0)})$ to avoid over-separating, where $c_{\max}$ is usually selected from $(0,1]$ . In the experiment part, we show that the optimal energy value accompanied with the minimized classification loss locates within the above range at each layer. Furthermore, hyperparameters $c^{k}_{\min}$ and $c_{\max}$ could be easily selected from large appropriate scopes, which do not affect the model performance.

Given both the low and upper limits, an intuitive solution to search the optimal energy is to train GNN by optimizing the following constrained problem:

\begin{array}[]{ll}\min&\mathcal{L}_{\mathrm{task}}+\gamma\sum_{k}||W^{(k)}||_{F},\\ \mathrm{s.t.}&c_{\min}E(X^{(k-1)})\leq E(X^{(k)})\leq c_{\max}E(X^{(0)}),\mathrm{for}\ k=1,\cdots,K.\end{array}

(6)

$\mathcal{L}_{\mathrm{task}}$ denotes the cross-entropy loss of node classification task; $K$ is layer number of GNN; $||\cdot||_{F}$ denotes Frobenius norm of a matrix; and $\gamma$ is loss hyperparameter.

4 Energetic Graph Neural Networks - EGNN

It is non-trivial to optimize Problem (6) due to the expensive computation of $E(X^{(k)})$ . Furthermore, the numerous constraints make the problem a very complex optimization hyper-planes, at which the raw task objective tends to fall into local optimums. Instead of directly optimizing Problem (6), we propose an efficient model EGNN to satisfy the constrained learning from three perspectives: weight controlling, residual connection and activation function. We introduce them one by one as follows.

4.1 Orthogonal Weight Controlling

According to Lemma $2$ , without regularizing the maximum square singular value $s^{(k)}_{\max}$ of matrix $W^{(k)}$ , the upper bound of Dirichlet energy can be larger than the upper limit, i.e., $s^{(k)}_{\max}E(X^{(k-1)})>c_{\max}E(X^{(0)})$ . That means the Dirichlet energy of a layer may break the upper limit of constrained learning, and makes Problem (6) infeasible. In this section, we show how to satisfy such limit by controlling the singular values during weight initialization and model regularization.

Orthogonal initialization.

Since the widely-used initialization methods (e.g., Glorot initialization) fail to restrict the scopes of singular values, we adopt the orthogonal approach that initializes trainable weight $W^{(k)}$ as a diagonal matrix with explicit singular values [36]. To restrict $s^{(k)}_{\max}$ and meet the constrained learning, we apply an equality constraint of $s^{(k)}_{\max}E(X^{(k-1)})=c_{\max}E(X^{(0)})$ at each layer. Based on this condition, we derive Proposition $2$ to initialize those weights $W^{(k)}$ and their square singular values for all the layers of EGNN, and give Lemma $3$ to show how we can satisfy the upper limit of constrained learning. The detailed derivation and proof are listed in Appendix.

Proposition 2.

At the first layer, weight $W^{(1)}$ is initialized as a diagonal matrix $\sqrt{c_{\max}}\cdot I_{d}$ , where $I_{d}$ is identity matrix with dimension $d$ and the square singular values are $c_{\max}$ . At the higher layer $k>1$ , weight $W^{(k)}$ is initialized with an identity matrix $I_{d}$ , where the square singular values are $1$ .

Lemma 3.

Based on the above orthogonal initialization, at the starting point of training, the Dirichlet energy of EGNN satisfies the upper limit at each layer $k$ : $E(X^{(k)})\leq c_{\max}E(X^{(0)})$ .

Orthogonal regularization.

However, without proper regularization, the initialized weights cannot guarantee they will still satisfy the constrained learning during model training. Therefore, we propose a training loss that penalizes the distances between the trainable weights and initialized weights $\sqrt{c_{\max}}I_{d}$ or $I_{d}$ . To be specific, we modify the optimization problem (6) as follows:

\min\mathcal{L}_{\mathrm{task}}+\gamma||W^{(1)}-\sqrt{c_{\max}}I_{d}||_{F}+\gamma\sum^{K}_{k=2}||W^{(k)}-I_{d}||_{F}.

(7)

Comparing with the original problem (6), we instead use the weight penalization to meet the upper limit of constrained learning, and make the model training efficient. While a larger $\gamma$ highly regularizes the trainable weights around the initialized ones to satisfy the constrained learning, a smaller $\gamma$ assigns the model more freedom to adapt to task data and optimize the node classification loss.

4.2 Lower-bounded Residual Connection

Although the square singular values are initialized and regularized properly, we may still fail to guarantee the lower limit of constrained learning in some specific graphs. According to Lemma $1$ , the lower bound of Dirichlet energy is $(1-\lambda_{1})^{2}s^{(k)}_{\min}E(X^{(k-1)})$ . In the real-world applications, eigenvalue $\lambda_{1}$ may exactly equal to $1$ and relaxes the lower bound as zero as shown in Lemma $2$ . For example, in Erdős–Rényi graph with dense connections [37], the eigenvalues of matrix $\tilde{\Delta}$ converge to $1$ with high probability [16]. Even though $s^{(k)}_{\min}>0$ , the Dirichlet energy can be smaller than the lower limit and leads to the over-smoothing. To tackle this problem, we adopt residual connections to the initial layer $X^{(0)}$ and the previous layer $X^{(k-1)}$ . To be specific, we define the residual graph convolutions as:

X^{(k)}=\sigma([(1-c_{\min})\tilde{P}X^{(k-1)}+\alpha X^{(k-1)}+\beta X^{(0)}]W^{(k)}).

(8)

$\alpha$ and $\beta$ are residual connection strengths determined by the lower limit of constrained learning, i.e., $\alpha+\beta=c_{\min}$ . We are aware that the residual technique has been used before to set up deep GNNs [25, 38, 27]. However, they either apply the whole residual components, or combine an arbitrary fraction without theoretical insight. Instead, we use an appropriate residual connection according to the lower limit of Dirichlet energy. In the experiment part, we show that while a strong residual connection overwhelms information in the higher layers and reduces the classification performance, a weak one will lead to the over-smoothing. In the following, we justify that both the lower and upper limits in the constrained learning can be satisfied with the proposed lower-bounded residual connection. The detailed proofs are provided in Appendix.

Lemma 4.

Suppose that $c_{\max}\geq c_{\min}/(2c_{\min}-1)^{2}$ . Based upon the orthogonal controlling and residual connection, the Dirichlet energy of initialized EGNN is larger than the lower limit at each layer $k$ , i.e., $E(X^{(k)})\geq c_{\min}E(X^{(k-1)})$ .

Lemma 5.

Suppose that $\sqrt{c_{\max}}\geq\frac{\beta}{(1-c_{\min})\lambda_{0}+\beta}$ . Being augmented with the orthogonal controlling and residual connection, the Dirichlet energy of initialized EGNN is smaller than the upper limit at each layer $k$ , i.e., $E(X^{(k)})\leq c_{\max}E(X^{(0)})$ .

4.3 SReLU Activation

Note that the previous theoretical analysis and model design are conducted by ignoring the activation function, which is usually given by ReLU in GNN. In this section, we first theoretically discuss the impact of ReLU on the Dirichlet energy, and then demonstrate the appropriate choice of activation.

Lemma 6.

We have $E(\sigma(X^{(k)}))\leq E(X^{(k)})$ if activation function $\sigma$ is ReLU or Leaky-ReLU [32].

It is shown that the application of ReLU further reduces the Dirichlet energy, since the negative embeddings are non-linearly mapped to zero. Although the trainable weights and residual connections are properly designed, the declining Dirichlet energy may violate the lower limit. On the other hand, a simplified GNN with linear identity activation will have limited model learning ability although it does not change the energy value. For example, simple graph convolution (SGC) model achieves comparable performance with the traditional GCN only with careful hyperparameter tuning [39]. We propose to apply SReLU to achieve a good trade-off between the non-linear and linear activations [40, 41]. SReLU is defined element-wisely as:

\sigma(X^{(k)})=\max(b,X^{(k)}),

(9)

where $b$ is a trainable shift shared for each feature dimension of $X^{(k)}$ . SReLU interpolates between the non-linearity and linearity depending on shift $b$ . While the linear identity activation is approximated if $b$ is close to $\infty$ , the non-linear mapping is activated if node embedding is smaller than the specific $b$ . In our experiments, we initialize $b$ with a negative value to provide an initial trade-off, and adapt it to the given task by back-propagating the training loss.

4.4 Connections to Previous Work

Recently, various techniques have been explored to enable deep GNNs [15, 23, 29]. Some of them are designed heuristically from diverse perspectives, and others are analogous to CNN components without theoretical insight tailored to graph analytics. In the following, we show how our principle and EGNN explain the existing algorithms, and expect to provide reliable theoretical guidance to the future design of deep GNNs.

Embedding normalization.

The general normalization layers, such as pair [22], batch [24] and group [23] normalizations, have been used to set up deep GNNs. The pair normalization (PairNorm) aims to keep the node pair distances as a constant in the different layers, and hence relieves the over-smoothing. Motivated from CNNs, the batch and group normalizations re-scale the node embeddings of a batch and a group, respectively. Similar to the operation in PairNorm, they learn to maintain the node pair distance in the node batch or group. The adopted Dirichlet energy is also a variant of the node pair distance. The existing normalization methods can be regarded as training GNN model with a constant energy constraint. However, this will prevent GNN from optimizing the energy as analyzed in Section 3. We instead regularize it within the lower and upper energy limits, and let model discover the optimum.

Dropping edge.

As a data augmentation method, dropping edge (DropEdge) randomly masks a fraction of edges at each epoch [28]. It makes graph connections sparse and relieves the over-smoothing by reducing information propagation. Specially, the contribution of DropEdge could be explained from the perspective of Dirichlet energy. In Erdős–Rényi graph, eigenvalue $\lambda_{0}$ converges to $1$ if the graph connections are more and more dense [16]. DropEdge reduces the value of $\lambda_{0}$ , and helps improve the upper bound of Dirichlet energy $(1-\lambda_{0})^{2}s_{\max}^{(k)}E(X^{(k-1)})$ to slow down the energy decreasing speed. In the extreme case where all the edges are dropped in any a graph, Laplacian $\tilde{\Delta}$ becomes a zero matrix. As a result, we have eigenvalue $\lambda_{0}$ of zero and maximize the upper bound. In practice, the dropping rate has to be determined carefully depending on various tasks. Instead, our principle assigns model freedom to optimize the Dirichlet energy within a large and appropriate range.

Residual connection.

Motivated from CNNs, residual connection has been applied to preserve the previous node embeddings and relieve the over-smoothing. Especially, the embedding from the last layer is reused and combined completely in related work [25, 42, 43]. A fraction of the initial embedding is preserved in model GCNII [27] and APPNP [44]. Networks JKNet [26] and DAGNN [45] aggregate all the previous embeddings at the final layers. The existing work uses the residua connection empirically. In this work, we derive and explain the residual connection to guarantee the lower limit of Dirichlet energy. By modifying hyperparameter $c_{\min}$ , our EGNN can easily evolve to the existing deep residual GNNs, such as GCNII and APPNP.

Model simplification.

Model SGC [39] removes all the activation and trainable weights to avoid over-fitting issue, and simplifies the training of deep GNNs. It is equivalent to EGNN with $c_{max}=1$ and $b=-\infty$ , where weights $W^{(k)}$ and shifts $b$ are remained as constants. Such simplification will reduce the model learning ability. As shown in Eq. (7), we adopt loss hyperparameter $\gamma$ to learn the trade-off between maintaining the orthogonal weights or updating them to model data characteristics.

5 Experiments

In this section, we empirically evaluate the effectiveness of EGNN on real-world datasets. We aim to answer the following questions. Q1: How does our EGNN compare with the state-of-the-art deep GNN models? Q2: Whether or not the Dirichlet energy at each layer of EGNN satisfies the constrained learning? Q3: How does each component of EGNN affect the model performance? Q4: How do the model hyperparameters impact the performance of EGNN?

5.1 Experiment Setup

Datasets.

Following the practice of previous work, we evaluate EGNN by performing node classification on four benchmark datasets: Cora, Pubmed [46], Coauthor-Physics [47] and Ogbn-arxiv [48]. The detailed statistics are listed in Appendix.

Baselines.

We consider seven state-of-the-art baselines: GCN [14], PairNorm [22], DropEdge [28], SGC [39], JKNet [26], APPNP [44], and GCNII [27]. They are implemented based on their open repositories. The detailed descriptions of these baselines are provided in Appendix.

Implementation.

We implement all the baselines using Pytorch Geometric [49] based on their official implementations. The model hyperparameters are reused according to the public papers or are fine-tuned by ourselves if the classification accuracy could be further improved. Specially, we apply max-pooling to obtain the final node representation at the last layer of JKNet. In Ogbn-arxiv, we additionally include batch normalization between the successive layers in all the considered GNN models except PairNorm. Although more tricks (e.g., label reusing and linear transformation as listed in leader board) could be applied to improve node classification in Ogbn-arxiv, we focus on comparing the original GNN models in enabling deep layer stacking. The training hyperparameters are carefully set by following the previous common setting and are listed in Appendix.

We implement our EGNN upon GCN, except for the components of weight initialization and regularization, lower-bounded residual connection and SReLU. We choose hyperparameters $c_{\max}$ , $c_{\min}$ , $\gamma$ and $b$ based on the validation set. For the weight initialization, we set $c_{\max}$ to be $1$ for all the datasets; that is, the trainable weights are initialized as identity matrices at all the graph convolutional layers. The loss hyperparameter $\gamma$ is $20$ in Cora, Pubmed and Coauthor-Physics to strictly regularize towards the orthogonal matrix; and it is $10^{-4}$ in Ogbn-arxiv to improve the model’s learning ability. For the lower-bounded residual connection, we choose residual strength $c_{\min}$ from range $[0.1,0.75]$ and list the details in Appendix. The trainable shift $b$ is initialized with $-10$ in Cora and Pubmed; it is initialized to $-5$ and $-1$ in Coauthor-Physics and Ogbn-arxiv, respectively. We also study these hyperparameters in the following experiments. All the experiment results are the averages of $10$ runs.

Table 1: Node classification accuracies in percentage with various depths:

2

16

32/64

. The highest accuracy at each column is in bold.

Datasets	Cora			Pubmed			Coauthors-Physics			Ogbn-arxiv
Layer Num	$2$	$16$	$64$	$2$	$16$	$64$	$2$	$16$	$32$	$2$	$16$	$32$
GCN	$82.5$	$22.0$	$21.9$	$\bf{79.7}$	$37.9$	$38.4$	$92.4$	$13.5$	$13.1$	$70.4$	$70.6$	$68.5$
PairNorm	$74.5$	$44.2$	$14.2$	$73.8$	$68.6$	$60.0$	$86.3$	$84.0$	$83.6$	$67.6$	$70.4$	$69.6$
DropEdge	$82.7$	$23.6$	$25.2$	$79.6$	$45.9$	$40.0$	$92.5$	$85.1$	$35.2$	$70.5$	$70.4$	$67.1$
SGC	$75.7$	$72.1$	$24.1$	$76.1$	$70.2$	$38.2$	$92.2$	$91.7$	$84.8$	$69.2$	$64.0$	$59.5$
JKNet	$80.8$	$74.5$	$70.0$	$77.2$	$70.0$	$66.1$	$\bf{92.7}$	$92.2$	$91.6$	$\bf{70.6}$	$71.8$	$71.4$
APPNP	$82.9$	$79.4$	$79.5$	$79.3$	$77.1$	$76.8$	$92.3$	$92.7$	$92.6$	$68.3$	$65.5$	$60.7$
GCNII	$82.4$	$84.6$	$85.4$	$77.5$	$79.8$	$79.9$	$92.5$	$92.9$	$92.9$	$70.1$	$71.5$	$70.5$
EGNN	$\bf{83.2}$	$\bf{85.4}$	$\bf{85.7}$	$79.2$	$\bf{80.0}$	$\bf{80.1}$	$92.6$	$\bf{93.1}$	$\bf{93.3}$	$68.4$	$\bf{72.7}$	$\bf{72.7}$

5.2 Experiment Results

Node classification results.

To answer research question Q1, Table 1 summarizes the test classification accuracies. Each accuracy is averaged over $10$ random trials. We report the results with $2/16/64$ layers for Cora and Pubmed, and $2/16/32$ layers for Coauthor-Physics and Ogbn-arxiv.

We observe that our EGNN generally outperforms all the baselines across the four datasets, especially in the deep cases ( $K\geq 16$ ). Notably, the node classification accuracy is consistently improved with the layer stacking in EGNN until $K=32$ or $64$ , which demonstrates the benefits of deep graph neural architecture to leverage neighbors multiple hops away. While the state-of-the-art models PairNorm, DropEdge, SGC, JKNet, and APPNP alleviate the over-smoothing issue to some extend, their performances still drop with the increasing of layers. Most of their $32$ / $64$ -layer models are even worse than their corresponding shallow versions. As the most competitive deep architecture in literature, GCNII augments the transformation matrix as $(1-\phi)I_{d}+\phi W^{(k)}$ , where $0<\phi<1$ is a hyperparameter to preserve the identity mapping and enhance the minimum singular value of the augmented weight. Instead of explicitly defining the strength of identity mapping, we propose the orthogonal weight initialization based on the upper limit of Dirichlet energy and apply the orthogonal weight regularization. Based on Eq. (7), EGNN automatically learns the optimal trade-off between identity mapping and task adaption. Furthermore, we use SReLU activation and the residual connection to theoretically control the lower limit of Dirichlet energy. The experimental results show that EGNN not only outperforms GCNII in the small graphs Cora, Pubmed and Coauthor-Physics, but also delivers significantly superior performance in the large graph Obgn-arxiv, achieving $3.1\%$ improvement over GCNII with 32 layers.

Refer to caption — Figure 1: Dirichelet energy variation with layers in Cora (Left) and Pubmed (Right). The upper and lower denotes the energy limits.

Dirichelet energy visualization.

To answer research question Q2, we show the Dirichlet energy at each layer of a $64$ -layer EGNN in Cora and Pubmed datasets in Figure 1. We only plot GCN and GCNII for better visualization purposes. For other methods, the Dirichlet energy is either close to zero or overly large due to the over-smoothing issue or over-separating issue of node embeddings, respectively.

It is shown that the Dirichlet energies of EGNN are strictly constrained within the range determined by the lower and upper limits of the constrained learning. Due to the over-smoothing issue in GCN, all the node embeddings converge to zero vectors. GCNII has comparable or smaller Dirichlet energy by carefully and explicitly designing both the initial connection and identity mapping strengths. In contrast, our EGNN only gives the appropriate limits of Dirichlet energy, and let the model learn the optimal energy at each layer for a specific task. The following hyperparameter studies will show that the values of $c_{\min}$ and $c_{\max}$ could be easily selected from a large appropriate range.

Table 2: Ablation studies on weight initialization, lower limit

c_{\min}

and activation function of EGNN.

Component	Type	Cora			Pubmed			Coauthors-Physics			Ogbn-arxiv
Component	Type	$2$	$16$	$64$	$2$	$16$	$64$	$2$	$16$	$32$	$2$	$16$	$32$
Weight	Glorot	$77.8$	$40.2$	$23.6$	$68.4$	$62.6$	$60.2$	$92.6$	$81.7$	$73.4$	$68.4$	$\bf{72.8}$	$72.7$
initialization	Orthogonal	$83.2$	$85.4$	$\bf{85.7}$	$\bf{79.2}$	$\bf{80.0}$	$\bf{80.1}$	$92.6$	$\bf{93.1}$	$\bf{93.3}$	$68.4$	$72.7$	$\bf{72.7}$
Lower	$0.$	$\bf{83.6}$	$68.6$	$12.9$	$78.9$	$77.1$	$44.1$	$\bf{92.8}$	$91.4$	$79.7$	$\bf{70.9}$	$69.4$	$62.4$
limit setting	$0.1\sim 0.75$	$83.2$	$85.4$	$\bf{85.7}$	$\bf{79.2}$	$\bf{80.0}$	$\bf{80.1}$	$92.6$	$\bf{93.1}$	$\bf{93.3}$	$68.4$	$72.7$	$\bf{72.7}$
$c_{\min}$	$0.95$	$65.4$	$72.0$	$71.5$	$74.0$	$75.3$	$75.7$	$89.4$	$90.4$	$90.5$	$56.5$	$66.8$	$69.5$
Activation	Linear	$83.1$	$\bf{85.6}$	$85.5$	$79.2$	$79.9$	$79.9$	$92.6$	$93.1$	$93.1$	$64.8$	$72.5$	$71.0$
	SReLU	$83.2$	$85.4$	$\bf{85.7}$	$\bf{79.2}$	$\bf{80.0}$	$\bf{80.1}$	$92.6$	$\bf{93.1}$	$\bf{93.3}$	$68.4$	$72.7$	$\bf{72.7}$
	ReLU	$83.1$	$85.2$	$85.0$	$79.1$	$79.7$	$79.9$	$92.6$	$93.1$	$93.1$	$68.6$	$72.4$	$72.4$

Ablation studies of EGNN components.

To demonstrate how each component affects the training of graph neural architecture and answer research question Q3, we perform the ablation experiments with EGNN on all the datasets. For the component of orthogonal weight initialization and regularization, we compare and replace them with the traditional Glorot initialization and Frobenius norm regularization as shown in Eq. (6). Considering the component of lower-bounded residual connection, we vary the lower limit hyperparameter $c_{\min}$ from $0$ , $0.1\sim 0.75$ and $0.95$ . Within the range of $0.1\sim 0.75$ , the adoption of specific values is specified for each dataset in Appendix. The component of the activation function is studied from candidates of linear identity activation, SReLU, and ReLU. Table 2 reports the results of the above ablation studies.

The orthogonal weight initialization and regularization are crucial to train the deep graph neural architecture. In Cora, Pubmed, and Coauthor-Physics, Glorot initialization and Frobenius norm regularization fail to control the singular values of trainable weights, which may lead to overly large or small Dirichlet energy and affect the node classification performance. In Ogbn-arxiv, the input node features are described by dense word embeddings of a paper [50], where the trainable weights in GNN are required to capture data statistics and optimize the classification task. EGNN applies a small loss hyperparameter $\gamma$ of $10^{-4}$ to let the model adapt to the given task, which is equivalent to the traditional regularization. Therefore, the two approaches have comparable performances.

An appropriate lower limit could enable the deep EGNN. While the Dirichlet energy may approach zero without the residual connection, the overwhelming residual information with $c_{\min}=0.95$ prevents the higher layer from learning the new neighborhood information. Within the large and appropriate range of $[0.1,0.75]$ , $c_{\min}$ could be easily selected to achieve superior performance.

Activation SReLU performs slightly better than the linear identity activation and ReLU. This is because SReLU could automatically learn the trade-off between linear and non-linear activations, which prevents the significant dropping of Dirichlet energy and ensures the model learning ability.

Hyperparameter analysis.

To understand the hyperparameter impacts on a $64$ -layer EGNN and answer research question Q4, we conduct experiments with different values of initial shift $b$ , loss factor $\gamma$ , lower limit factor $c_{\min}$ and upper one $c_{\max}$ . We present the hyperparameter study in Figure 2 for Cora, and show the others with similar tendencies in Appendix.

We observe that our method is not sensitive to the choices of $b$ , $\gamma$ , $c_{\min}$ and $c_{\max}$ in a wide range. The initial shift value should be $b\leq 0$ , in order to avoid the overly nonlinear mapping and Dirichlet energy damage. The loss factor within the range $\gamma\geq 1$ is applied to regularize the trainable weight around the orthogonal matrix to avoid the explosion or vanishing of Dirichlet energy. $c_{\min}$ within the appropriate range $[0.1,0.75]$ allows the model to expand neighborhood size and preserve residual information to avoid the over-smoothing. As shown in Figure 1, since energy $E(X^{(k)})$ at the hidden layer is much smaller than $E(X^{(0)})$ from the input layer, we could easily satisfy the upper limit with $c_{\max}$ in a large range $[0.2,1]$ . Given these large hyperparameter ranges, EGNN could be easily trained with deep layers.

6 Conclusions

In this paper, we propose a Dirichlet energy constrained learning principle to show the importance of regularizing the Dirichlet energy at each layer within reasonable lower and upper limits. Such energy constraint is theoretically proved to help avoid the over-smoothing and over-separating issues. We then design EGNN based on our theoretical results and empirically demonstrate that the constrained learning plays a key role in guiding the design and training of deep graph neural architecture. The detailed analysis is presented to illustrate how our principle connects and combines the previous deep methods. The experiments on benchmarks show that EGNN could be easily trained to achieve superior node classification performances with deep layer stacking. We believe that the constrained learning principle will help discover deeper and more powerful GNNs in the future.

References

[1] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
[2] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983, 2018.
[3] Xiao Huang, Qingquan Song, Yuening Li, and Xia Hu. Graph recurrent networks with attributed random walks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 732–740, 2019.
[4] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The World Wide Web Conference, pages 417–426, 2019.
[5] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1416–1424, 2018.
[6] Hongyang Gao and Shuiwang Ji. Graph u-nets. In international conference on machine learning, pages 2083–2092. PMLR, 2019.
[7] Kaixiong Zhou, Qingquan Song, Xiao Huang, and Xia Hu. Auto-gnn: Neural architecture search of graph neural networks. arXiv preprint arXiv:1909.03184, 2019.
[8] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
[9] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
[10] Federico Monti, Michael M Bronstein, and Xavier Bresson. Geometric matrix completion with recurrent multi-graph neural networks. arXiv preprint arXiv:1704.06803, 2017.
[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeuIPS, pages 1024–1034, 2017.
[12] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv, 1(2), 2017.
[13] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
[14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.
[15] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[16] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020.
[17] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550, 2019.
[18] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. arXiv preprint arXiv:1909.03211, 2019.
[19] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205, 2020.
[20] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations. https://openreview. net/forum, 2021.
[21] Wenbing Huang, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. Tackling over-smoothing for general graph convolutional networks. arXiv e-prints, pages arXiv–2008, 2020.
[22] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019.
[23] Kaixiong Zhou, Xiao Huang, Yuening Li, Daochen Zha, Rui Chen, and Xia Hu. Towards deeper graph neural networks with differentiable group normalization. Advances in Neural Information Processing Systems, 33, 2020.
[24] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
[25] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9267–9276, 2019.
[26] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5453–5462. PMLR, 2018.
[27] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. arXiv preprint arXiv:2007.02133, 2020.
[28] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations. https://openreview. net/forum, 2020.
[29] Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in Neural Information Processing Systems, 33, 2020.
[30] Arman Hasanzadeh, Ehsan Hajiramezanali, Shahin Boluki, Mingyuan Zhou, Nick Duffield, Krishna Narayanan, and Xiaoning Qian. Bayesian graph neural networks with adaptive connection sampling. In International Conference on Machine Learning, pages 4094–4104. PMLR, 2020.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[32] Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
[33] Kaixiong Zhou, Qingquan Song, Xiao Huang, Daochen Zha, Na Zou, and Xia Hu. Multi-channel graph neural networks. arXiv preprint arXiv:1912.08306, 2019.
[34] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. arXiv preprint arXiv:1802.09691, 2018.
[35] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
[36] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
[37] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
[38] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
[39] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
[40] Sitao Xiang and Hao Li. On the effects of batch and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971, 2017.
[41] Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, and Jitendra Malik. Deep isometric learning for visual recognition. In International Conference on Machine Learning, pages 7824–7835. PMLR, 2020.
[42] Lei Chen, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 27–34, 2020.
[43] Kuangqi Zhou, Yanfei Dong, Kaixin Wang, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng. Understanding and resolving performance degradation in graph convolutional networks, 2020.
[44] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[45] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 338–348, 2020.
[46] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
[47] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
[48] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
[49] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop, 2019.
[50] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013.

Appendix A Appendix

A.1 Dataset Statistics

We conduct experiments on four benchmark graph datasets, including Cora, Pubmed, Coauthor-Physics and Ogbn-arxiv. They are widely used to study the over-smoothing issue and test the performance of deep GNNs. We use the public train/validation/test split in Cora and Pubmed, and randomly split Coauthor-Physics by following the previous practice. Their data statistics are summarized in Table 3.

Table 3: Data statistics.

Datasets	# Nodes	# Edges	# Classes	# Features	# Train/Validation/Test nodes	Setting
Cora	2,708	5,429	7	1,433	140/500/1,000	Transductive (one graph)
Pubmed	19,717	44,338	3	500	60/500/1,000	Transductive (one graph)
Coauthor-Physics	34,493	247,962	5	8,415	100/150/34,243	Transductive (one graph)
Ogbn-arxiv	169,343	1,166,243	40	128	90,941/29,799/48,603	Transductive (one graph)

A.2 Baselines

To validate the effectiveness of the Dirichlet energy constrained learning principle and our EGNN on the node classification problem, we consider baseline GCN and other state-of-the-art deep GNNs based upon GCN. They are summarized as follows:

•

GCN [14]. It is mathematically defined in Eq. (1), which learns the node embeddind by simply propagating messages over the normalized adjacency matrix.
•

PairNorm [22]. Based upon GCN, PairNorm is applied between the successive graph convolutional layers to normalize node embeddings and to alleviate the over-smoothing issue.
•

DropEdge [28]. It randomly removes a certain number of edges from the input graph at each training epoch, which reduces the convergence speed of over-smoothing.
•

SGC [39]. It simplies the vanilla GCN by removing all the hidden weights and activation functions, which could avoid the over-fitting issue in GCN.
•

Jumping knowledge network (JKNet) [26]. Based upon GCN, all the hidden node embeddings are combined at the last layer to adapt the effective neighborhood size for each node. Herein we apply max-pooling to combine the the series of node embeddings from the hidden layers.
•

Approximate personalized propagation of neural predictions (APPNP) [44]. It applies personalized PageRank to improve the message propagation scheme in vanilla GCN. Furthermore, APPNP simplifies model by removing the hidden weight and activation function and preserving a small fraction of initial embedding at each layer.
•

Graph convolutional network via initial residual and identity mapping (GCNII) [27]. It is an extension of the vanilla GCN model with two simple techniques at each layer: an initial connection to the input feature and an identity mapping added to the trainable weight.

A.3 Implementation Details

For each experiment, we train with a maximum of $1500$ epochs using the Adam optimizer and early stopping. Following the previous common settings in the considered benchmarks, we list the key training hyperparameters for each of them in Table 4. All the experiment results are reported by the averages of $10$ independent runs.

Table 4: The training hyperparameter settings in benchmarks.

Dataset	Dropout rate	Weight decay (L2)	Learning rate	# Training epoch
Cora	$0.6$	$5\cdot 10^{-4}$	$5\cdot 10^{-3}$	$1500$
Pubmed	$0.5$	$5\cdot 10^{-4}$	$1\cdot 10^{-2}$	$1500$
Coauthor-Physics	$0.6$	$5\cdot 10^{-5}$	$5\cdot 10^{-3}$	$1500$
Ogbn-arxiv	$0.1$	$0$	$3\cdot 10^{-3}$	$1000$

A.4 Lower Limit Setting

We carefully choose the lower limit hyperparameter $c_{\min}$ from range $[0.1,0.75]$ for each dataset based on the classification performance and Dirichlet energy on the validation set. Note that we have the residual connection strengths $\alpha$ and $\beta$ which satisfy constraint: $\alpha+\beta=c_{\min}$ . Specially, we use $c_{\min}$ of $0.2$ (layer number $K<32$ ) and $0.15$ ( $K\geq 32$ ) in Cora, where $\alpha,\beta=0.1$ in all the layer cases. We use $c_{\min}$ of $0.12$ ( $K<32$ ) and $0.11$ ( $K\geq 32$ ) in Pubmed, where $\beta=c_{\min}$ and $\alpha=0$ . We apply $c_{\min}$ of $0.12$ in Coauthor-Physics, where $\beta=0.1$ and $\alpha=0.02$ . We apply $c_{\min}$ and $\beta$ of $0.6$ and $0.1$ when $K<32$ , respectively; we make use of $c_{\min}$ and $\beta$ of $0.75$ and $0.25$ if $K\geq 32$ , respectively. In these two cases, the residual connection strength $\alpha$ to the last layer is $0.5$ .

A.5 Proof for Lemma 1

Lemma 1.

The Dirichlet energy at the $k$ -th layer is bounded as follows:

(1-\lambda_{1})^{2}s^{(k)}_{\min}E(X^{(k-1)})\leq E(X^{(k)})\leq(1-\lambda_{0})^{2}s^{(k)}_{\max}E(X^{(k-1)}).

Proof.

By ignoring the activation function, we obtain the upper bound as below.

\begin{array}[]{rl}E(X^{(k)})&=E(\tilde{P}X^{(k-1)}W^{(k)})\\ &=\text{tr}((\tilde{P}X^{(k-1)}W^{(k)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}W^{(k)}))\\ &=\text{tr}((\tilde{P}X^{(k-1)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)})W^{(k)}(W^{(k)})^{\top})\\ &\leq\text{tr}((\tilde{P}X^{(k-1)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}))\sigma_{\max}(W^{(k)}(W^{(k)})^{\top})\\ &=\text{tr}((\tilde{P}X^{(k-1)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}))s^{(k)}_{\max}\\ &=\text{tr}([(I_{n}-\tilde{\Delta})X^{(k-1)}]^{\top}\tilde{\Delta}[(I_{n}-\tilde{\Delta})X^{(k-1)}])s^{(k)}_{\max}\\ &=\text{tr}((X^{(k-1)})^{\top}\tilde{\Delta}(I_{n}-\tilde{\Delta})^{2}X^{(k-1)})s^{(k)}_{\max}\\ &\leq(1-\lambda_{0})^{2}\text{tr}((X^{(k-1)})^{\top}\tilde{\Delta}X^{(k-1)})s^{(k)}_{\max}\\ &=(1-\lambda_{0})^{2}s^{(k)}_{\max}E(X^{(k-1)}).\end{array}

$\sigma_{\max}(\cdot)$ denotes the maximum eigenvalue of a matrix, and $\tilde{P}=I_{n}-\tilde{\Delta}$ . Since $\text{tr}(X^{\top}\tilde{\Delta}X)\geq 0$ , where $X\in\mathbb{R}^{n\times d}$ is a feature matrix, we can obtain the inequality relationship: $\text{tr}(X^{\top}\tilde{\Delta}XW^{(k)}(W^{(k)})^{\top})\leq\text{tr}(X^{\top}\tilde{\Delta}X)\sigma_{\max}(W^{(k)}(W^{(k)})^{\top})$ . In a similar way, we can also get the upper bound of $(1-\lambda_{0})^{2}s^{(k)}_{\max}E(X^{(k-1)})$ .

Similarly, we derive the lower bound as below.

\begin{array}[]{rl}E(X^{(k)})&=E(\tilde{P}X^{(k-1)}W^{(k)})\\ &=\text{tr}((\tilde{P}X^{(k-1)}W^{(k)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}W^{(k)}))\\ &\geq\text{tr}((\tilde{P}X^{(k-1)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}))\sigma_{\min}(W^{(k)}(W^{(k)})^{\top})\\ &=\text{tr}((\tilde{P}X^{(k-1)})^{\top}\tilde{\Delta}(\tilde{P}X^{(k-1)}))s^{(k)}_{\min}\\ &=\text{tr}((X^{(k-1)})^{\top}\tilde{\Delta}(I_{n}-\tilde{\Delta})^{2}X^{(k-1)})s^{(k)}_{\min}\\ &\geq(1-\lambda_{1})^{2}\text{tr}((X^{(k-1)})^{\top}\tilde{\Delta}X^{(k-1)})s^{(k)}_{\min}\\ &=(1-\lambda_{1})^{2}s^{(k)}_{\min}E(X^{(k-1)}).\end{array}

A.6 Derivation of Proposition 2

All the trainable weights at the graph convolutional layers of EGNN are initialized as the orthogonal diagonal matrices. At the first layer, the upper bound of Dirichlet energy is given by $s^{(1)}_{\max}E(X^{(0)})$ . Given constraint $s^{(1)}_{\max}E(X^{(0)})=c_{\max}E(X^{(0)})$ , we can obtain $W^{(1)}=\sqrt{c_{\max}}\cdot I_{d}$ . The square singular values are then restricted as: $s^{(1)}_{\min}=s^{(1)}_{\max}=c_{\max}$ . For layer $k>1$ , we further relax the upper bound as: $s^{(k)}_{\max}E(X^{(k-1)})\leq\prod_{j=1}^{k}s^{(j)}_{\max}E(X^{(0)})$ . Note that $s^{(1)}_{\max}=c_{\max}$ at the first layer. Given constraint $\prod_{j=1}^{k}s^{(j)}_{\max}E(X^{(0)})=c_{\max}E(X^{(0)})$ , we can obtain weight $W^{(k)}=I_{d}$ . The square singular values are restricted as: $s^{(k)}_{\min}=s^{(k)}_{\max}=1$ , and $\prod_{j=1}^{k}s^{(j)}_{\max}=c_{\max}$ .

A.7 Proof for Lemma 3

Lemma 3.

Based on the above orthogonal initialization, at the starting point of training, the Dirichlet energy of EGNN satisfies the upper limit at each layer $k$ : $E(X^{(k)})\leq c_{\max}E(X^{(0)})$ .

Proof.

According to Lemma 2, the Dirichlet energy at layer $k$ is limited as:

\begin{array}[]{rl}E(X^{(k)})&\leq s^{(k)}_{\max}E(X^{(k-1)})\\ &\leq\prod_{j=1}^{k}s^{(j)}_{\max}E(X^{(0)})\\ &=c_{\max}E(X^{(0)}).\end{array}

A.8 Proof for Lemma 4

Lemma 4.

Proof.

To obtain the Dirichlet energy relationship between $E(X^{(k)})$ and $E(X^{(k-1)})$ , we first expand node embedding $X^{(k)}$ as the series summation in terms of the initial node embedding $X^{(0)}$ . We then re-express the graph convolution at layer $k$ , which is simplified to depend only on node embedding $X^{(k-1)}$ . As a result, we can easily derive Lemma 4. The detailed proofs are provided in the following.

According to Eq. (8), by ignoring the activation function $\sigma$ , we obtain the residual graph convolution at layer $k$ as:

\begin{array}[]{rl}X^{(k)}&=[(1-c_{\min})\tilde{P}X^{(k-1)}+\alpha X^{(k-1)}+\beta X^{(0)}]W^{(k)}.\\ &=[(1-c_{\min})\tilde{P}+\alpha I_{n}]X^{(k-1)}W^{(k)}+\beta X^{(0)}W^{(k)},\end{array}

where $I_{n}$ is an identity matrix with dimension $n$ , and $\alpha+\beta=c_{\min}$ . We define $Q\triangleq(1-c_{\min})\tilde{P}+\alpha I_{n}$ , and then simply the above graph convolution as:

X^{(k)}=QX^{(k-1)}W^{(k)}+\beta X^{(0)}W^{(k)}.

(10)

To facilitate the proof, we further expand the above graph convolution as the series summation in terms of the initial node embedding $X^{(0)}$ as:

X^{(k)}=Q^{k}X^{(0)}\prod_{j=1}^{k}W^{(j)}+\beta\sum_{i=0}^{k-1}(Q^{i}X^{(0)}\prod_{j=k-i}^{k}W^{(j)}),

where the weight matrix product is defined as: $\prod_{j=1}^{k}W^{(j)}\triangleq W^{(1)}W^{(2)}\cdots W^{(k)}$ , and $Q^{0}\triangleq I_{n}$ . Notably, in our EGNN, the trainable weight $W^{(1)}$ at the first layer is orthogonally initialized as diagonal matrix of $\sqrt{c_{\max}}\cdot I_{d}$ , while $W^{(j)}$ at layer $j>1$ is initialized as identity matrix $I_{d}$ . Therefore, the series expansion of $X^{(k)}$ could be simplified as:

\begin{array}[]{rl}X^{(k)}&=(\sqrt{c_{\max}}Q^{k}+\beta\sqrt{c_{\max}}Q^{k-1}+\beta\sum_{i=0}^{k-2}Q^{i})X^{(0)}\\ &\triangleq Z^{(k)}X^{(0)}.\end{array}

$Z^{(1)}\triangleq\sqrt{c_{\max}}Q+\beta\sqrt{c_{\max}}I_{n}$ at the case $k=1$ . Note that $Z^{(k)}$ is invertible if all the eigenvalues of matrix $Q$ are not equal to zero, which could be achieved by selecting an appropriate $\alpha$ depending on the downstream task. Let $\tilde{Z}^{(k)}=[Z^{(k)}]^{-1}$ . We then represent the initial node embedding $X^{(0)}$ as: $X^{(0)}=\tilde{Z}^{(k)}X^{(k)}$ . Similarly, $X^{(0)}=\tilde{Z}^{(k-1)}X^{(k-1)}$ at layer $k-1$ . Therefore, we can re-express the graph convolution at layer $k$ in Eq. (10) as:

X^{(k)}=(Q+\beta\tilde{Z}^{(k-1)})X^{(k-1)}W^{(k)}

According to Lemma 1, the lower bound of Dirichlet energy at layer $k$ is given by:

\begin{array}[]{rl}E(X^{(k)})&\geq\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(k-1)})s^{(k)}_{\min}E(X^{(k-1)}).\end{array}

$\lambda^{2}_{\min}(\cdot)$ denotes the minimum square eigenvalue of a matrix. To get the minimum square eigenvalue, we represent the eigenvalue decomposition of matrix $Q$ as: $Q=V\Lambda V^{-1}$ , where $V\in\mathbb{R}^{n\times n}$ is the eigenvector matrix and $\Lambda\in\mathbb{R}^{n\times n}$ is the diagonal eigenvalue matrix. We then decompose $(Q+\beta\tilde{Z}^{(k-1)})$ as:

\begin{array}[]{rl}Q+\beta\tilde{Z}^{(k-1)}&=Q+\beta(\sqrt{c_{\max}}Q^{k-1}+\beta\sqrt{c_{\max}}Q^{k-2}+\beta\sum_{i=0}^{k-3}Q^{i})^{-1}\\ &=V\Lambda V^{-1}+\beta V(\sqrt{c_{\max}}\Lambda^{k-1}+\beta\sqrt{c_{\max}}\Lambda^{k-2}+\beta\sum_{i=0}^{k-3}\Lambda^{i})^{-1}V^{-1}.\end{array}

Let $\lambda_{Q}$ denote the eigenvalue of matrix $Q$ . Recalling $Q\triangleq(1-c_{\min})\tilde{P}+\alpha I_{n}$ and $\tilde{P}\triangleq I_{n}-\tilde{\Delta}$ . Since the eigenvalues of $\tilde{P}$ are within $(-1,1]$ , we have $-(1-c_{\min})+\alpha<\lambda_{Q}\leq(1-c_{\min})+\alpha<1$ . To ensure that $Q$ is invertible, we could apply a larger value of $\alpha$ to have $-(1-c_{\min})+\alpha>0$ . The square eigenvalue of matrix $(Q+\beta\tilde{Z}^{(k-1)})$ is:

\lambda^{2}(Q+\beta\tilde{Z}^{(k-1)})=(\lambda_{Q}+\beta(\sqrt{c_{max}}\lambda_{Q}^{k-1}+\beta\sqrt{c_{max}}\lambda_{Q}^{k-2}+\beta\frac{1-\lambda_{Q}^{k-2}}{1-\lambda_{Q}})^{-1})^{2}

It could be easily validated that $\frac{\partial\log(\lambda^{2}(Q+\beta\tilde{Z}^{(k-1)}))}{\partial k}\geq 0$ . That means the square eigenvalue increases with the layer $k$ . Considering the extreme case of $k\rightarrow\infty$ , we obtain $\lambda^{2}(Q+\beta\tilde{Z}^{(k-1)})\rightarrow 1$ . Since $s^{(k)}_{\min}=1$ at layer $k>1$ , we thus obtain $E(X^{(k)})\geq E(X^{(k-1)})\geq c_{\min}E(X^{(k-1)})$ when $k\rightarrow\infty$ . In practice, since $\lambda^{2}(Q+\beta\tilde{Z}^{(k-1)})$ approximates to one with the increasing of layer $k$ , the Dirichlet energy will be maintained as a constant at the higher layers of EGNN, which is empirically validated in Figure 1.

The minimum square eigenvalue is achieved when $k=1$ , i.e., $\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(0)})$ , where $\tilde{Z}^{(0)}=I_{n}$ and $\lambda_{Q}$ is close to $-(1-c_{\min})+\alpha$ . In this case, we obtain $\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(0)})=(2c_{\min}-1)^{2}$ . At layer $k=1$ , we have $s^{(1)}_{\min}=c_{\max}$ . Since $E(X^{(1)})\geq\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(0)})c_{\max}E(X^{(0)})$ , to make sure $E(X^{(1)})\geq c_{\min}E(X^{(0)})$ at the first layer, we only need to satisfy the following condition:

\begin{array}[]{cc}&\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(0)})c_{\max}\geq c_{\min}\\ \Rightarrow&c_{\max}\geq c_{\min}/(2c_{\min}-1)^{2}.\end{array}

Note that the square eigenvalue is increasing with $k$ , and $s^{(k)}_{\min}=1\geq c_{\max}$ for layer $k>1$ . At the higher layer $k>1$ , we have $\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(k-1)})s^{(k)}_{\min}\geq\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(0)})c_{\max}$ . Therefore, once the condition of $c_{\max}\geq c_{\min}/(2c_{\min}-1)^{2}$ is satisfied, we can obtain $\lambda^{2}_{\min}(Q+\beta\tilde{Z}^{(k-1)})s^{(k)}_{\min}\geq c_{\min}$ and $E(X^{(k)})\geq c_{\min}E(X^{(k-1)})$ for all the layers $k$ in EGNN.

A.9 Proof for Lemma 5

Lemma 5.

Proof.

According to the proof of Lemma 4, we have $X^{(k)}=Z^{(k)}X^{(0)}$ . Based on Lemma 1, the upper bound of Dirichlet energy at layer $k$ is given by:

E(X^{(k)})\leq\lambda^{2}_{\max}(Z^{(k)})E(X^{(0)}),

where $\lambda^{2}_{\max}(\cdot)$ is the maximum square eigenvalue of a matrix. According to the definition of $Z^{(k)}$ and the eigenvalue decomposition of $Q$ in the proof of Lemma 4, we decompose $Z^{(k)}$ as:

\begin{array}[]{rl}Z^{(k)}&=(\sqrt{c_{\max}}Q^{k}+\beta\sqrt{c_{\max}}Q^{k-1}+\beta\sum_{i=0}^{k-2}Q^{i})\\ &=V(\sqrt{c_{\max}}\Lambda^{k}+\beta\sqrt{c_{\max}}\Lambda^{k-1}+\beta\sum_{i=0}^{k-2}\Lambda^{i})V^{-1}.\end{array}

Therefore, the square eigenvalue of $Z^{(k)}$ is given by:

\lambda^{2}(Z^{(k)})=(\sqrt{c_{max}}\lambda_{Q}^{k}+\beta\sqrt{c_{max}}\lambda_{Q}^{k-1}+\beta\frac{1-\lambda_{Q}^{k-1}}{1-\lambda_{Q}})^{2},

where $\lambda_{Q}$ denotes the eigenvalue of matrix $Q$ . Recalling $Q\triangleq(1-c_{\min})\tilde{P}+\alpha I_{n}$ and $\tilde{P}\triangleq I_{n}-\tilde{\Delta}$ . The maximum square eigenvalue $\lambda^{2}_{\max}(Z^{(k)})$ is achieved when $\lambda_{Q}$ takes the largest value, i.e., $\lambda_{Q}=\theta_{0}=(1-c_{\min})(1-\lambda_{0})+\alpha$ , where $\lambda_{0}$ is the non-zero eigenvalue of matrix $\tilde{\Delta}$ that is most close to value $0$ . Therefore, we have $\lambda^{2}_{\max}(Z^{(k)})=c_{\max}(\theta^{k}_{0}+\beta\theta^{k-1}_{0}+\frac{\beta(1-\theta^{k-1}_{0})}{\sqrt{c_{\max}}(1-\theta_{0})})^{2}$ . To ensure that $E(X^{(k)})\leq c_{\max}E(X^{(0)})$ for all the layers, we have to satisfy the condition of $\lambda^{2}_{\max}(Z^{(k)})\leq c_{\max}$ . Since $\theta_{0}>0$ , we simplify this condition in the followings:

\begin{array}[]{cc}&\lambda^{2}_{\max}(Z^{(k)})\leq c_{\max}\\ \Rightarrow&\theta^{k}_{0}+\beta\theta^{k-1}_{0}+\frac{\beta(1-\theta^{k-1}_{0})}{\sqrt{c_{\max}}(1-\theta_{0})}\leq 1\\ \Rightarrow&\frac{\beta(1-\theta^{k-1}_{0})}{(1-\theta^{k}_{0}-\beta\theta^{k-1}_{0})(1-\theta_{0})}\leq\sqrt{c_{\max}}\\ \Rightarrow&\frac{\beta(1-\theta^{k-1}_{0})}{(1-\theta^{k-1}_{0}(\beta+\theta_{0}))(1-\theta_{0})}\leq\sqrt{c_{\max}}\\ \Rightarrow&\frac{\beta(1-\theta^{k-1}_{0})}{(1-\theta^{k-1}_{0}(1-(1-c_{\min})\lambda_{0}))(1-\theta_{0})}\leq\sqrt{c_{\max}}\end{array}

Note that $0<(1-(1-c_{\min})\lambda_{0})<1$ and $1-\theta^{k-1}_{0}<1-\theta^{k-1}_{0}(1-(1-c_{\min})\lambda_{0})$ . The above condition can be satisfied if $\frac{\beta}{1-\theta_{0}}\leq\sqrt{c_{\max}}$ . Note that $1-\theta_{0}=(1-c_{\min})\lambda_{0}+\beta$ . Therefore, if $\sqrt{c_{\max}}\geq\frac{\beta}{(1-c_{\min})\lambda_{0}+\beta}$ , we obtain $E(X^{(k)})\leq c_{\max}E(X^{(0)})$ for all the layers in EGNN. Such condition can be easily satisfied by adopting $c_{\max}=1$ .

A.10 Proof for Lemma 6

Lemma 6.

We have $E(\sigma(X^{(k)}))\leq E(X^{(k)})$ if activation function $\sigma$ is ReLU or Leaky-ReLU [32].

Proof.

Herein we directly adopt the proof from [32] to support the self-containing in this paper. Let $c_{1}$ , $c_{2}\in\mathbb{R}_{+}$ , and let $a$ , $b\in\mathbb{R}$ . We have the following relationships:

\begin{array}[]{cc}|c_{1}a-c_{2}b|&\geq|\sigma(c_{1}a)-\sigma(c_{2}b)|\\ &=|c_{1}\sigma(a)-c_{2}\sigma(b)|.\end{array}

The first inequality holds for activation function $\sigma$ whose Lipschitz constant is smaller than $1$ , including ReLU and Leaky-ReLU. The second equality holds because $\sigma(cx)=c\sigma(x)$ , $\forall c\in\mathbb{R}_{+}$ and $x\in\mathbb{R}$ . Recalling the Dirichlet energy definition in Eq. (2): $E(X^{(k)})=\frac{1}{2}\sum a_{ij}||\frac{x^{(k)}_{i}}{\sqrt{1+d_{i}}}-\frac{x^{(k)}_{j}}{\sqrt{1+d_{j}}}||_{2}^{2}$ . By extending to the vector space and replacing $c_{1}$ , $c_{2}$ , $a$ , and $b$ with $\frac{1}{\sqrt{1+d_{i}}}$ , $\frac{1}{\sqrt{1+d_{j}}}$ , $x^{(k)}_{i}$ , and $x^{(k)}_{j}$ , respectively, we can obtain $E(\sigma(X^{(k)}))\leq E(X^{(k)})$ .

A.11 Hyperparameter Analysis

To further understand the hyperparameter impacts on EGNN and answer research question Q4, we conduct more experiments and show in Figures 3, 4 and 5 for Pubmed, Coauthor-Physics and Ogbn-arxiv, respectively.

Similar to the hyperparameter study on Cora, we observe that our method is consistently not sensitive to the choices of $b$ , $\gamma$ , $c_{\min}$ and $c_{\max}$ within wide value ranges for all the datasets. The appropriate ranges of $b$ , $\gamma$ , $c_{\min}$ and $c_{\max}$ are $(-\infty,0]$ , $[1,\infty]$ , $[0.1,0.75]$ and $[0.2,1]$ , respectively. Specially, in the large graph of Ogbn-arxiv, our model could even has a large initialization range for $b$ . Given these wide hyperparameter ranges, EGNN could be easily constructed and trained with deep layers.