This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Graph Attention Networks

Blind Review    Jun Kato Kwansei Gakuin University
Graduate School of Science and Technology
Sanda, Japan
[email protected]
   Airi Mita, Keita Gobara, Akihiro Inokuchi Kwansei Gakuin University
School of Science and Technology
Sanda, Japan
[email protected]
Abstract

Graphs are useful for representing various real-world objects. However, graph neural networks (GNNs) tend to suffer from over-smoothing, where the representations of nodes of different classes become similar as the number of layers increases, leading to performance degradation. A method that does not require protracted tuning of the number of layers is needed to effectively construct a graph attention network (GAT), a type of GNN. Therefore, we introduce a method called “DeepGAT” for predicting the class to which nodes belong in a deep GAT. It avoids over-smoothing in a GAT by ensuring that nodes in different classes are not similar at each layer. Using DeepGAT to predict class labels, a 15-layer network is constructed without the need to tune the number of layers. DeepGAT prevented over-smoothing and achieved a 15-layer GAT with similar performance to a 2-layer GAT, as indicated by the similar attention coefficients. DeepGAT enables the training of a large network to acquire similar attention coefficients to a network with few layers. It avoids the over-smoothing problem and obviates the need to tune the number of layers, thus saving time and enhancing GNN performance.

Index Terms:
Graph Neural Network, Graph Attention Network, Reproductive Property of Probability Distribution

I Introduction

The graph is a useful data structure to represent various objects in the real world. For example, accounts and links in social networks correspond to nodes and edges in graphs, respectively, and relationships in the networks are represented as graphs. Atoms and chemical bonds in molecules also correspond to nodes and edges in graphs, respectively, and the molecules are thus represented as graphs. Various other objects, such as hyperlink structures and function calls in computer programs, can also be represented by graphs. Because various objects can be stored as graph data, technologies for analyzing such data have been attracting a great deal of attention in recent years [9, 23, 19].

The graph neural network (GNN) [23] is a representative deep learning method for graph data. GNNs convolve features of neighboring nodes of a node vv to the representation of vv. GNNs with LL layers repeat this convolution LL times, and obtain the representation 𝒉v\bm{h}_{v} of vv. Once each node is represented in vector form, various conventional machine learning methods such as classification, clustering, and regression can be applied to the graph data, including for node classification, node clustering, and link prediction.

Representative GNNs include the graph convolutional network (GCN) [9] and graph attention network (GAT) [19]. Figure 1 shows the difference between convolutions of GCN and GAT; the features of nodes in the darker blue background greatly affect the representation of vv. For each node vv, GCN uniformly convolves the features with ll steps from vv to the representation of vv, whereas GAT non-uniformly convolves the features to the representation of vv using the attention coefficient. Thus, GCN convolves the features of red nodes in Fig. 1 to the representation of vv, whereas GAT does not always convolve the features to the representation of vv, which results in different representations of the structure around vv learned by GCNs and GATs.

Refer to caption
Figure 1: Convolutions in graph neural networks.

Most GNNs, including GCNs and GATs, have a common drawback called over-smoothing [13, 15, 2, 29], which is a phenomenon in which the representations of nodes belonging to different classes are similar to each other when the number of layers is increased. This results in degradation of GNN performance. For this reason, low-layer GCNs or GATs with 2-3 layers are often used in actual operations. However, for some graph datasets, even a deep GAT can provide high performance, so we must tune the number of layers LL by starting with a low-layer GAT and gradually increasing LL. This tuning requires considerable time. Therefore, a methodology that can construct a deep GAT for many datasets without spending time tuning LL would be useful.

To achieve this, this paper [8] proposes a method for predicting which class each node belongs to at each layer in a deep GAT, in which the prediction error at each layer is incorporated into the loss function. On the basis of the prediction, if a node vv has neighbors that belong to the same class as vv, only features of the neighbors are convolved into the representation of vv. Unlike the case of over-smoothing, the likelihood that features of nodes belonging to a different class from vv are convolved into vv is small, and thus the representations of nodes belonging to different classes are not similar to each other; therefore, over-smoothing can be avoided.

The contributions of this paper are summarized as follows:

  • First, our proposed method, called DeepGAT, achieves a 15-layer GAT with the same level of performance as a 2-layer GAT. Instead of starting with a small LL and gradually increasing its value, the DeepGAT with 15 layers can be constructed without tuning LL.

  • Second, we mathematically show why DeepGAT can avoid over-smoothing, on the basis of the reproductive property of the probability distribution (see Lemma 2). We also show mathematically that the overlap in distributions of representations of nodes belonging to different classes is small, and we demonstrate that this is also true for benchmark datasets.

II Preliminaries

A graph whose nodes have features is represented as G=(V,E,X)G=(V,E,X), where V={1,2,,n}V=\{1,2,\ldots,n\} is a set of nodes, EV×VE\subseteq V\times V is a set of edges, and X={𝒙1,𝒙2,,𝒙n}d×nX=\{\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{n}\}\in\mathbb{R}^{d\times n} is a set of feature vectors. We define a neighborhood of vertex vv as Nv={u(v,u)E}{v}N_{v}=\{u\mid(v,u)\in E\}\cup\{v\}. Each node belongs to one of two classes C={0,1}C=\{0,1\}111In this paper, we address the binary classification of nodes, although the proposed method can be extended to multi-class node classification, multi-label node classification, and graph classification., and we denote nodes belonging to the class cCc\in C as VcVV_{c}\subset V. In addition, we assume that feature 𝒙v\bm{x}_{v} of node vVcv\in V_{c} has the following properties: 𝒩(𝝁c,Σc)\mathcal{N}(\bm{\mu}_{c},\Sigma_{c}).

We call a sequence of edges (v1,v1),(v2,v2),,(vl,vl)\langle(v_{1},v^{\prime}_{1}),(v_{2},v^{\prime}_{2}),\ldots,(v_{l},v^{\prime}_{l})\rangle satisfying vi=vi+1v^{\prime}_{i}=v_{i+1} a path, and ll the length of the path. Paths in this paper are not always simple, and we allow the path to satisfy vi=viv_{i}=v^{\prime}_{i}. We denote a set of paths of length ll from node vv to node uu as PvulP_{v\sim u}^{l}. In addition, we denote a vector representing the numbers of paths of length ll from vv as 𝝅vl=uV|Pvul|𝟏n,u=(|Pv1l|,|Pv2l|,,|Pvnl|)T\bm{\pi}_{v}^{l}=\sum_{u\in V}|P_{v\sim u}^{l}|\bm{1}_{n,u}=(|P_{v\sim 1}^{l}|,|P_{v\sim 2}^{l}|,\ldots,|P_{v\sim n}^{l}|)^{T}, where 𝟏n,u\bm{1}_{n,u} is the nn-dimensional one-hot vector whose uu-th element is 1. By using these notations, we define 𝝅v,cl=uVc|Pvul|𝟏n,u\bm{\pi}_{v,c}^{l}=\sum_{u\in V_{c}}|P_{v\sim u}^{l}|\bm{1}_{n,u} and Pvl=cCuVcPvul=cCPv,clP_{v}^{l}=\bigcup_{c\in C}\bigcup_{u\in V_{c}}P_{v\sim u}^{l}=\bigcup_{c\in C}P_{v,c}^{l}.

Given a graph GG and the ground-truth class labels CC^{\prime} for nodes VVV^{\prime}\subset V as input, the problem addressed in this paper is the accurate prediction of class labels for nodes VVV\setminus V^{\prime} via learning node representations for VV from GG and CC^{\prime}.

III Related Works and their Drawbacks

We will now discuss GAT, which is a type of GNN in which node representations are obtained from a graph. For each node vv in a graph, GAT convolves the features of nodes NvN_{v} to the feature of vv. The convolution of GAT consists of Aggregate and Combine, as shown below:

𝒂vl\displaystyle\bm{a}_{v}^{l} =\displaystyle= 𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒l({𝒉ul1|uNv})=uNvαvu𝒉ul1,\displaystyle{\it Aggregate}^{l}(\{\bm{h}_{u}^{l-1}~{}|~{}u\in N_{v}\})=\sum_{u\in N_{v}}\alpha_{vu}\bm{h}_{u}^{l-1}, (1)
𝒉vl\displaystyle\bm{h}_{v}^{l} =\displaystyle= 𝐶𝑜𝑚𝑏𝑖𝑛𝑒l(𝒉vl1,𝒂vl)=σ(Wl𝒂vl),\displaystyle{\it Combine}^{l}(\bm{h}_{v}^{l-1},\bm{a}_{v}^{l})=\sigma(W^{l}\bm{a}_{v}^{l}), (2)

where WlW^{l} is the learnable matrix and σ\sigma is the activation function. In addition, αvu\alpha_{vu} is the attention coefficient computed from 𝒉vl1\bm{h}_{v}^{l-1} and 𝒉ul1\bm{h}_{u}^{l-1}. Starting from 𝒉v0=𝒙v\bm{h}_{v}^{0}=\bm{x}_{v}, GAT repeats this convolution LL times and then outputs 𝒉vL\bm{h}_{v}^{L} as the representation for node vv 222Although GATs usually use multi-head attention, we discuss GATs with single-head attention until Section 4 of this paper. The software used in Section 5 is implemented with KK multi-head attention..

GAT is an extension of GCN. αvul\alpha^{l}_{vu} in GCN does not depend on 𝒉vl\bm{h}_{v}^{l} and 𝒉ul\bm{h}_{u}^{l}, and αvul\alpha^{l}_{vu} is 1|Nv|\frac{1}{|N_{v}|}. In contrast, GAT aggregates features of neighboring nodes with similar features to node vv into a representation of vv. Thus, it does not convolve uniform features of nodes with ll steps from vv, but instead convolves while selecting features of nodes in particular directions, as shown in Fig. 1. The following equations are used for computing attention coefficients between two vectors 𝒒\bm{q} and 𝒌\bm{k}:

𝐴𝐷(𝒒,𝒌)\displaystyle{\it AD}(\bm{q},\bm{k}) =\displaystyle= 𝒘,𝒒||𝒌(additive attention [1, 19, 21])\displaystyle\langle\bm{w},\bm{q}||\bm{k}\rangle~{}~{}~{}\mbox{(additive attention~{}\cite[cite]{[\@@bibref{}{AD-Attention1,velivckovic2017graph,Kato1-Wang}{}{}]})}
𝐷𝑃(𝒒,𝒌)\displaystyle{\it DP}(\bm{q},\bm{k}) =\displaystyle= 𝒒,𝒌(dot-product attention [18, 14]),\displaystyle\langle\bm{q},\bm{k}\rangle~{}~{}~{}\mbox{(dot-product attention~{}\cite[cite]{[\@@bibref{}{AttentionVaswani,DP-Attention}{}{}]})},
𝑆𝐷(𝒒,𝒌)\displaystyle{\it SD}(\bm{q},\bm{k}) =\displaystyle= 𝒒,𝒌dim(𝒒)(scaled dot-product attention),\displaystyle\frac{\langle\bm{q},\bm{k}\rangle}{dim(\bm{q})}~{}~{}~{}\mbox{(scaled dot-product attention)},

where 𝒘\bm{w} is a learnable vector, dimdim is a function to obtain the number of dimensions of a vector, ,\langle\cdot,\cdot\rangle is a dot product, and |||| is an operation to concatenate two vectors. Using one of these attention functions, αvu\alpha_{vu} is computed by

α¯vul\displaystyle\overline{\alpha}_{vu}^{l} =\displaystyle= attention(Wl𝒉vl,Wl𝒉ul)\displaystyle attention(W^{l}\bm{h}_{v}^{l},W^{l}\bm{h}_{u}^{l})
αvul\displaystyle\alpha_{vu}^{l} =\displaystyle= exp(α¯vu)wNvexp(α¯vw)(Softmax).\displaystyle\frac{\exp(\overline{\alpha}_{vu})}{\sum_{w\in N_{v}}\exp(\overline{\alpha}_{vw})}~{}~{}~{}\mbox{(Softmax)}.

GAT has been actively applied to various methods. For example, the gated attention network [26] calculates importance values of KK multi-heads. Heterogenous GANs [21] can learn heterogeneous graphs with GAT, and the signed GAN [7] combines directed code networks and GAT. The relational GAN [27] adds new attention networks that can learn the dependencies between emotional polarity and opinion words, called aspects, in sentences.

One of the major challenges for GNNs, including GATs, is over-smoothing. Over-smoothing is a phenomenon in which node representations 𝒉vL=𝒉v\bm{h}_{v}^{L}=\bm{h}_{v} are similar to each other when the number of layers LL in the GNN is increased, resulting in a decrease in classification accuracy. Yajima and Inokuchi proved with the following lemma that the distribution followed by 𝒉vl\bm{h}_{v}^{l} in a GCN is normal, using the reproductive property of probability distributions.

Lemma 1 ([24])

We assume that σ\sigma in a GCN is the identity function. The representations 𝐡vl\bm{h}^{l}_{v} in the GCN follow the normal distribution defined below.

𝒩(|Pv,0l||Pvl|W1l𝝁0+|Pv,1l||Pvl|W1l𝝁1,\displaystyle\mathcal{N}\left(\frac{|P_{v,0}^{l}|}{|P_{v}^{l}|}W_{1\sim l}\bm{\mu}_{0}+\frac{|P_{v,1}^{l}|}{|P_{v}^{l}|}W_{1\sim l}\bm{\mu}_{1},\right.
𝝅v,0l22|Pvl|2W1lTΣ0W1l+𝝅v,1l22|Pvl|2W1lTΣ1W1l),\displaystyle\left.\frac{\|\bm{\pi}_{v,0}^{l}\|_{2}^{2}}{|P_{v}^{l}|^{2}}W_{1\sim l}^{T}\Sigma_{0}W_{1\sim l}+\frac{\|\bm{\pi}_{v,1}^{l}\|_{2}^{2}}{|P_{v}^{l}|^{2}}W_{1\sim l}^{T}\Sigma_{1}W_{1\sim l}\right),

where W1l=WlWl1W1W_{1\sim l}=W^{l}W^{l-1}\cdots W^{1}. \blacksquare

According to Lemma 1, the mean 𝝁L\bm{\mu}_{L} of the distribution followed by 𝒉v=𝒉vL\bm{h}_{v}=\bm{h}_{v}^{L} to be used for classification is |Pv,0L|𝝁0+|Pv,1L|𝝁1|Pv,0L|+|Pv,1L|W1L\frac{|P_{v,0}^{L}|\bm{\mu}_{0}+|P_{v,1}^{L}|\bm{\mu}_{1}}{|P_{v,0}^{L}|+|P_{v,1}^{L}|}W_{1\sim L}. For large LL, because the ratio |Pv,0L||Pv,1L|\frac{|P_{v,0}^{L}|}{|P_{v,1}^{L}|} approaches the class ratio |V0||V1|\frac{|V_{0}|}{|V_{1}|}, 𝝁L\bm{\mu}_{L} internally divides 𝝁0\bm{\mu}_{0} and 𝝁1\bm{\mu}_{1} in the ratio. Because the representations of nodes belonging to the two classes follow a distribution with an identical mean 𝝁L\bm{\mu}_{L} for large LL, they are similar to each other. This also holds true for GATs. Therefore, GCNs and GATs often have two or three layers.

The objective of this paper is to extend GAT, i.e., to derive a multiple-layer GAT to achieve high classification accuracy while preventing over-smoothing, by taking advantage of the properties of the representations shown in Lemma 1.

According to the article [11], there are three main approaches to reduce over-smoothing: Normalization, Residual connection, and Change of GNN Dynamics. Typical methods of the first approach are DropEdge [16] and PairNorm [28]. DropEdge randomly removes a certain number of edges from the input graph at each training epoch. PairNorm transfoms input features in prepossessing as follows.

𝒙^v\displaystyle\hat{\bm{x}}_{v} =\displaystyle= 𝒙v1|V|uV𝒙u,\displaystyle\bm{x}_{v}-\frac{1}{|V|}\sum_{u\in V}\bm{x}_{u},
𝒙v\displaystyle\bm{x}_{v} =\displaystyle= s𝒙^v1|V|uV𝒙^u22,\displaystyle\frac{s\hat{\bm{x}}_{v}}{\sqrt{\frac{1}{|V|}\sum_{u\in V}||\hat{\bm{x}}_{u}||_{2}^{2}}},

where s>0s>0 is the hyper-parameter. In the second approach, the residual connection [6, 12] allows the input to bypass one or more layers and be added directly to the output of a later layer. One of implementations is represented as 𝒙vl=𝒙vl1+σ(Wl𝒂vl)\bm{x}_{v}^{l}=\bm{x}_{v}^{l-1}+\sigma(W^{l}\bm{a}_{v}^{l}). This technique helps mitigate the vanishing gradient problem, enabling the training of deeper networks. GCNII [3] using residual connection demonstrated that the over-smoothing is reduced in 64-layer GNNs. The third approach qualitatively changes the dynamics of message passing propagation. For example, in Gradient Gating (G2[10], local gradients due to differences between features are harnessed to further modulate convolution as follows.

𝝉^vl\displaystyle\hat{\bm{\tau}}_{v}^{l} =\displaystyle= σ(Wl𝒂vl),\displaystyle\sigma(W^{l}\bm{a}_{v}^{l}),
τvkl\displaystyle\tau_{vk}^{l} =\displaystyle= tanh(uNv|τ^vlτ^ul|p),\displaystyle\tanh\left(\sum_{u\in N_{v}}|\hat{\tau}_{v}^{l}-\hat{\tau}_{u}^{l}|^{p}\right),
𝒙vl\displaystyle\bm{x}_{v}^{l} =\displaystyle= (1𝝉vl)𝒙vl1+𝝉vlσ(Wl𝒂vl),\displaystyle(1-\bm{\tau}^{l}_{v})\odot\bm{x}_{v}^{l-1}+\bm{\tau}^{l}_{v}\odot\sigma(W^{l}\bm{a}_{v}^{l}),

where p0p\leq 0 is the hyper-parameter and \odot is the Hadamard product.

IV DeepGAT

IV-A GAT with Oracle Predicting α¯vu\overline{\alpha}_{vu}

In this subsection, we assume that there exists the following oracle oo that can output exactly whether two nodes vv and uNvu\in N_{v} belong to the same class.

o(v,u)={1(if nodes v and u belong to the same class)0(otherwise).o(v,u)=\left\{\begin{array}[]{ll}1&(\mbox{if nodes $v$ and $u$ belong to the same class})\\ 0&(\mbox{otherwise}).\end{array}\right.

For node vv, only features of nodes belonging to the same class as vv are convolved into the representation of vv, when we use α¯vul=o(v,u)\overline{\alpha}_{vu}^{l}=o(v,u). In this case, we obtain the following lemma.

Lemma 2

Let paths consisting of only nodes belonging to the same class as vv among PvulP_{v\sim u}^{l} be Pˇvul\check{P}_{v\sim u}^{l}. We define 𝛑ˇv,cl\check{\bm{\pi}}_{v,c}^{l} as uVc|Pˇvul|𝟏n,u\sum_{u\in V_{c}}|\check{P}_{v\sim u}^{l}|\bm{1}_{n,u}. When the activation functions are the identify functions, the representation 𝐡vl\bm{h}^{l}_{v} of vv belonging to the class cc in GAT with the oracle follows

𝒩(W1l𝝁c,𝝅ˇv,cl22𝝅ˇv,cl12W1lTΣcW1l).\displaystyle\mathcal{N}\left(W_{1\sim l}\bm{\mu}_{c},\frac{\|\check{\bm{\pi}}_{v,c}^{l}\|_{2}^{2}}{\|\check{\bm{\pi}}_{v,c}^{l}\|_{1}^{2}}W_{1\sim l}^{T}\Sigma_{c}W_{1\sim l}\right).~{}\blacksquare (3)
Proof 1

The linear sum a𝐪+b𝐪a\bm{q}+b\bm{q}^{\prime} of random variables 𝐪\bm{q} and 𝐪\bm{q}^{\prime} that follows the identical distribution 𝒩(𝛍,Σ)\mathcal{N}(\bm{\mu},\Sigma) also follows 𝒩((a+b)𝛍,(a2+b2)Σ)\mathcal{N}((a+b)\bm{\mu},(a^{2}+b^{2})\Sigma). Therefore, because Eq. (1) aggregates |NvVc||N_{v}\cap V_{c}| features that follow an identical distribution in the first layer of GAT, in which the oracle controls attention coefficients, 𝐚v1\bm{a}_{v}^{1} follows 𝒩(𝛍c,1|NvVc|Σc)\mathcal{N}(\bm{\mu}_{c},\frac{1}{|N_{v}\cap V_{c}|}\Sigma_{c}). Because the linear transformation of random variables that follows the normal distribution generates random variables that follow another normal distribution, 𝐡v1\bm{h}_{v}^{1} follows 𝒩(W1𝛍c,1|NvVc|W1TΣcW1)\mathcal{N}(W^{1}\bm{\mu}_{c},\frac{1}{|N_{v}\cap V_{c}|}{W^{1}}^{T}\Sigma_{c}W^{1}). In the ll-layer GAT with the oracle, the number of features convolved to the representation of vv is consistent with the number of paths |Pˇvl|=|uVPˇvul|=𝛑ˇv,cl1|\check{P}_{v}^{l}|=|\bigcup_{u\in V}\check{P}_{v\sim u}^{l}|=\|\check{\bm{\pi}}_{v,c}^{l}\|_{1}. Therefore, the distribution that 𝐡vl\bm{h}^{l}_{v} follows is expressed as Eq. (3). \Box

The mean of the distribution followed by representations of nodes belonging to the class cc never depends on the mean 𝝁c¯\bm{\mu}_{\overline{c}} for the other class c¯\overline{c}. When WlW^{l} is orthogonal, the variance of Eq. (3) becomes 𝝅ˇv,cl22𝝅ˇv,cl12Σc\frac{\|\check{\bm{\pi}}_{v,c}^{l}\|_{2}^{2}}{\|\check{\bm{\pi}}_{v,c}^{l}\|_{1}^{2}}\Sigma_{c}. 0<𝝅ˇv,cl22𝝅ˇv,cl1210<\frac{\|\check{\bm{\pi}}_{v,c}^{l}\|_{2}^{2}}{\|\check{\bm{\pi}}_{v,c}^{l}\|_{1}^{2}}\leq 1 holds, and the variance decreases with an increase of ll. Because representations 𝒉vl\bm{h}_{v}^{l} of nodes vv belonging to the class cc follow a different distribution from the distribution that 𝒉ul\bm{h}_{u}^{l} of nodes uu belonging to c¯\overline{c} follow, and as the variance in the distributions decreases with an increase of ll, increasing LL in turn increases the likelihood that the GAT with the oracle can classify correctly.

Fig. 2 shows the distribution of the input feature vectors 𝒙v=𝒉v0\bm{x}_{v}=\bm{h}_{v}^{0}. Two bell-shaped distributions are class 0 and class 1, and the gray areas are the overlap of those distributions. The overlap is related to the Bayes error rate. The Bayes error rate is the lowest possible error rate for any classifier, representing the limit of classification accuracy given the inherent noise in the data. It is the probability that a randomly chosen data point is misclassified by an optimal classifier that knows the true underlying distributions. Under the presence of oracles, the variances of the distributions of 𝒉vl\bm{h}_{v}^{l} are reduced by convolving ll times, and the overlap of the distributions is correspondingly reduced. Therefore, at each layer, if we can correctly predict the class to which each node belongs, we can expect to reduce the over-smoothing.

Refer to caption
Figure 2: The overlaps of two distributions.

IV-B Implementation of the Oracle in DeepGAT

input :  G=(V,E,X)G=(V,E,X), Yn×|C|Y\in\mathbb{R}^{n\times|C|}, and LL
output : Predictions of class labels
1 (𝒉10,𝒉20,,𝒉n0)X(\bm{h}_{1}^{0},\bm{h}_{2}^{0},\ldots,\bm{h}_{n}^{0})\leftarrow X
2 for l{1,2,,L}l\in\{1,2,\ldots,L\} do
      3 Zrm_diag(A^l1)YZ\leftarrow rm\_diag(\hat{A}^{l-1})Y
      4 (𝒛1,𝒛2,,𝒛n)ZT(\bm{z}_{1},\bm{z}_{2},\ldots,\bm{z}_{n})\leftarrow Z^{T}
      5 for  vVv\in V do
            6 𝒑vl1W1l(𝒉vl1||𝒛v)\bm{p}_{v}^{l-1}\leftarrow W^{l}_{1}(\bm{h}_{v}^{l-1}||\bm{z}_{v})
            7 𝒚^vl1softmax(𝒑vl1)\hat{\bm{y}}_{v}^{l-1}\leftarrow softmax(\bm{p}_{v}^{l-1})
            
      8for (v,u)E(v,u)\in E do
            9 α¯vulatt(𝒚^vl1,𝒚^ul1)\overline{\alpha}_{vu}^{l}\leftarrow att(\hat{\bm{y}}_{v}^{l-1},\hat{\bm{y}}_{u}^{l-1})
            10 α¯uvlα¯vul\overline{\alpha}_{uv}^{l}\leftarrow\overline{\alpha}_{vu}^{l}
            
      11for  vVv\in V do
            12 for  uNvu\in N_{v} do
                  13 αvulexp(α¯vul)uNvexp(α¯vul)\alpha_{vu}^{l}\leftarrow\frac{\exp(\overline{\alpha}_{vu}^{l})}{\sum_{u\in N_{v}}\exp(\overline{\alpha}_{vu}^{l})}
            14𝒂vluNvαvu𝒉ul1\bm{a}_{v}^{l}\leftarrow\sum_{u\in N_{v}}\alpha_{vu}\bm{h}_{u}^{l-1}
            15 𝒉vlσ(W2l𝒂vl)\bm{h}_{v}^{l}\leftarrow\sigma(W_{2}^{l}\bm{a}_{v}^{l})
            
      
16for  vVv\in V do
      17 𝒚^vLsoftmax(𝒉vL)\hat{\bm{y}}_{v}^{L}\leftarrow softmax(\bm{h}_{v}^{L})
18return {𝒚^vl(v,l)V×{0,1,,L}}\{\hat{\bm{y}}_{v}^{l}\mid(v,l)\in V\times\{0,1,\ldots,L\}\}
Algorithm 1 Feedforward procedures of DeepGAT
input :  G=(V,E,X)G=(V,E,X) and LL
output : Predictions of class labels
1 (𝒉10,𝒉20,,𝒉n0)X(\bm{h}_{1}^{0},\bm{h}_{2}^{0},\ldots,\bm{h}_{n}^{0})\leftarrow X
2 for l{1,2,,L}l\in\{1,2,\ldots,L\} do
      3 for (v,u)E(v,u)\in E do
            4 α¯vulatt(Wl𝒉vl1,Wl𝒉ul1)\overline{\alpha}_{vu}^{l}\leftarrow att(W^{l}\bm{h}_{v}^{l-1},W^{l}\bm{h}_{u}^{l-1})
            5 α¯uvlα¯vul\overline{\alpha}_{uv}^{l}\leftarrow\overline{\alpha}_{vu}^{l}
      6for  vVv\in V do
            7 for  uNvu\in N_{v} do
                  8 αvulexp(α¯vul)uNvexp(α¯vul)\alpha_{vu}^{l}\leftarrow\frac{\exp(\overline{\alpha}_{vu}^{l})}{\sum_{u\in N_{v}}\exp(\overline{\alpha}_{vu}^{l})}
            9𝒂vluNvαvu𝒉ul1\bm{a}_{v}^{l}\leftarrow\sum_{u\in N_{v}}\alpha_{vu}\bm{h}_{u}^{l-1}
            10 𝒉vlσ(Wl𝒂vl)\bm{h}_{v}^{l}\leftarrow\sigma(W^{l}\bm{a}_{v}^{l})
      
11for  vVv\in V do
      12 𝒚^vLsoftmax(W𝒉vL)\hat{\bm{y}}_{v}^{L}\leftarrow softmax(W\bm{h}_{v}^{L})
13return {𝒚^vLvV}\{\hat{\bm{y}}_{v}^{L}\mid v\in V\}
Algorithm 2 Feedforward procedures of GAT

This subsection discusses how to implement the oracle. Because a GAT with a few layers can predict class labels with reasonably high accuracy, the proposed DeepGAT method predicts the class labels of nodes in each layer. To achieve this, DeepGAT linearly transforms each representation in each layer into a |C||C|-dimensional representation 𝒑vl\bm{p}_{v}^{l}, and then computes 𝒚^vl=(y^v,0l,y^v,1l)T|C|=2\hat{\bm{y}}_{v}^{l}=(\hat{y}_{v,0}^{l},\hat{y}_{v,1}^{l})^{T}\in\mathbb{R}^{|C|}=\mathbb{R}^{2} from the representations 𝒑vl\bm{p}_{v}^{l} in each layer, where y^v,cl\hat{y}_{v,c}^{l} is the probability of vv belonging to class cc predicted in the ll-th layer. If the prediction 𝒚^vl\hat{\bm{y}}_{v}^{l} is completely correct, then 𝒚^vl,𝒚^ul\langle\hat{\bm{y}}_{v}^{l},\hat{\bm{y}}_{u}^{l}\rangle coincides with o(v,u)o(v,u). Therefore, let α¯vul\overline{\alpha}_{vu}^{l} be 𝒚^vl,𝒚^ul\langle\hat{\bm{y}}_{v}^{l},\hat{\bm{y}}_{u}^{l}\rangle.

Algorithm 1 shows the pseudo code of the feedforward procedures of our DeepGAT. For comparison, the pseudo-code for the conventional GAT is shown in Algorithm 2. The correct convolution of DeepGAT depends on the accuracy of the prediction 𝒚^vl\hat{\bm{y}}_{v}^{l} at each layer. Various previous researches have demonstrated that incorporating class label information Yn×|C|Y\in\mathbb{R}^{n\times|C|} as additional input can improve GNN performance. To facilitate the accuracy of this prediction, we combine DeepGAT with this label propagation [25, 20, 22, 17]. Here, YY is the raw label matrix whose rows corresponding to nodes VV^{\prime} with graph-truth labels are one-hot vectors and the other rows for nodes VVV\setminus V^{\prime} are filled with zeros. A^\hat{A} in Line 3 of Algorithm 1 is the row-normalized form of adjacency matrix of GG, A^l\hat{A}^{l} is the ll-th power of A^\hat{A}, and the function rm_diagrm\_diag sets the diagonal elements of the matrix to zero in order to avoid the label leakage [25]. In Line 6, the concatenation of the representation 𝒉vl1\bm{h}_{v}^{l-1} and label aggregation 𝒛v\bm{z}_{v} is linearly transformed with the learnable parameter W1lW_{1}^{l}, and then the class labels for vv are predicted in Line 7. In Lines 9 and 10, attention coefficients are computed from the predicted class labels. The convolutions using the attention coefficients are repeated for LL layers, and class labels 𝒚^vL\hat{\bm{y}}_{v}^{L} are predicted in the final layer.

DeepGAT learns W1lW_{1}^{l} and W2lW_{2}^{l} from graph-truth labels. According to Subsection IV-A, if DeepGAT can correctly predict labels of nodes in each layer, the variance of representations in the final layer will be smaller. Since prediction errors in the layers close to the input layer propagate toward the output layer, the correctness of predictions in the layers close to the output layer depends on the correctness of predictions in the layers close to the input layer. Therefore, in order to suppress prediction errors in the layers close to the input layer, we use the following loss function.

l=1Lγ(l)vV[yvlogy^v,0l+(1yv)log(1y^v,1)]¯\displaystyle-\sum_{l=1}^{L}\gamma(l)\underline{\sum_{v\in V^{\prime}}[y_{v}\log\hat{y}_{v,0}^{l}+(1-y_{v})\log(1-\hat{y}_{v,1})]} (4)
where γ(l)=δl+δ+1\displaystyle\mbox{where }\gamma(l)=\frac{\delta}{l+\delta}+1

The underlined part of Eq. (4) is the cross-entropy loss function for the ll-th layer, and it is multiplied by γ(l)\gamma(l) which decreases monotonically with increasing the hyperparameter δ>0\delta>0 and converges to 1. By using the loss function (4), the proposed DeepGAT method aims to both

  • minimize prediction error by minimizing Eq. (4) and

  • reduce the variance of the distributions by Eq. (3).

V Experimental Evaluation

V-A Experimental Setting

TABLE I: Summary of benchmark datasets.
Datasets # of graphs |V||V| |E||E| dd |C||C| avg. max. hub node diameter density avg. cluster
degree degree rate [%] 2|E||V|(|V|1)\frac{2|E|}{|V|(|V|-1)} coefficient ccecce
Cora 1 2,708 10,556 1,433 7 9.8 338 2.18% 19 0.18% 24.07%
CiteSeer 1 3,327 9,104 3,703 6 7.5 200 1.23% 28 0.11% 14.15%
CS 1 18,333 163,788 6,805 15 19.9 274 18.04% 24 0.05% 34.25%
Physics 1 34,493 495,924 8,415 3 30.8 766 36.40% 17 0.04% 37.76%
Flickr 1 89,250 899,756 500 7 22.2 10,852 11.57% 8 0.01% 3.30%
PPI 24 56,944 818,716 50 121 28.3 721 27.43% 27.06%
Refer to caption
Figure 3: Performance degradation of GAT.

This section demonstrates that DeepGAT prevents over-smoothing by comparing DeepGAT with the original GAT. In the experiments in this paper, the proposed method is not compared with state-of-the-art GNNs, but only with GAT. By comparing only with GAT, we verify whether predicting which class each node belongs to at each layer and convolving representations of nodes based on the prediction is effective. If we find that it is effective, we can use our ideas in various methods that use GAT. Comparisons with the world’s best GNNs and GNNs that have achieved depths of over 100 layers are included in future plans. We implemented DeepGAT with Pytorch and Pytorch Geometric. See config.yaml at https://github.com/JNKT215/DeepGAT_CANDAR for the hyperparameter settings for each model.

Table I summarizes the benchmark datasets used in our experimental evaluation. Cora, CiteSeer, CS, and Physics are datasets for article citation networks, PPI is a network for protein-protein interactions, and Flickr is a network of images. We treated those networks as undirected graphs. PPI is a dataset for the multi-label node classification, and the other datasets are used for multi-class node classification. So, the label propagation in Lines 3 and 4 of Algorithm 1 was not used for PPI. We treated nodes with degree greater than 30 as hub nodes, and the hub node rate was set to |{vvV,|Nv|>30}||V|\frac{|\{v\mid v\in V,|N_{v}|>30\}|}{|V|}. Figure 3 shows the performance degradation of the GAT which is defined as the difference between the best micro-F1 score for various LL on each dataset and a micro-F1 score at the maximum LL that could be computed in an acceptable time. The performance degradation of GAT with respect to CS and Physics is significant, while one with respect to Cora and CiteSeer is not significant. The former datasets have the hub node rates, and many of them also have higher cluster coefficients ccecce. When there are more hub nodes in the graph, the average distance between any two nodes is smaller, so the number of layers for a feature at one node to propagate as a representation of another node is smaller. This makes over-smoothing more likely to occur. In the next subsection, we report whether the proposed method can mitigate over-smoothing for CS, Physics, Flickr, and PPI whose hub node rates are more than 10%.

V-B Experimental Results

V-B1 Micro-F1 Scores with Various Numbers of Layers

First, we confirmed that DeepGAT prevents over-smoothing when the number of layers is increased. Figure 4 shows the micro-F1 scores for DeepGAT and GAT with various numbers of layers and the CS, Physics, and Flickr datasets. DP and SD indicate the dot-product attention and scaled dot-product attention, respectively. To suppress prediction errors and reduce computation time for layers close to the input layer, label propagation in Lines 3 and 4 of Algorithm 1 was limited to the first through the third layers.

When LL is increased, the micro-F1 scores of GAT decrease significantly due to over-smoothing. In contrast, the DeepGAT with 15 layers for CS and Physics and one with 9 layers for Flickr maintain a relatively high micro-F1 score compared with the DeepGAT with 2 layers, which indicates that DeepGAT has the ability to prevent over-smoothing. The reason why the scores slightly decrease is that the same amount of data was used to train W11,W12,,W1L,W21,W22,,W2LW^{1}_{1},W^{2}_{1},\ldots,W^{L}_{1},W^{1}_{2},W^{2}_{2},\ldots,W^{L}_{2} for DeepGATs with 2 and 15 layers. Despite the fact that the difference between Algorithms 1 and 2 is slight, the difference in their performance is very large. Table II shows the best micro-F1 score (denoted by “best”) for various LL on each dataset and a micro-F1 score (denoted by “max.”) at the maximum LL that could be computed in an acceptable time. The numbers in parentheses in Table II represent the numbers of layers LL at the time the scores were obtained, and the numbers in bold are the highest scores for each dataset. Table II shows that DeepGAT is superior to GAT and the use of DeepGAT allows for the construction of deeper networks than GAT.

Refer to caption

(a) Micro-F1 scores for CS.
Refer to caption
(b) Micro-F1 scores for Physics.
Refer to caption
(c) Micro-F1 scores for Flickr.
Refer to caption
(d) Micro-F1 scores for PPI.

Figure 4: Micro-F1 scores with various numbers of layers LL for the CS, Physics, Flickr, and PPI datasets.
TABLE II: Best Micro-F1 scores and Micro-F1 scores at the maximum LL.
Model CS Physics Flickr PPI
DeepGAT best 91.5 (2) 93.6 (4) 54.6 (2) 98.9 (7)
(DP) max. 83.1 (15) 91.8 (15) 53.2 (9) 98.8 (9)
DeepGAT best 90.4 (2) 93.6 (5) 53.8 (4) 99.2 (8,9)
(SD) max. 82.7 (15) 90.8 (15) 47.2 (9) 99.2 (9)
GAT best 90.5 (2) 93.4 (2) 52.4 (3) 92.5 (3)
(DP) max. 9.3 (15) 28.1 (15) 42.3 (9) 44.0 (9)
GAT best 90.6 (2) 92.5 (2) 52.9 (3) 99.1 (6)
(SD) max. 4.6 (15) 12.3 (15) 42.3 (9) 98.9 (9)
Refer to caption (a) CS Refer to caption (b) Physics Refer to caption (c) Flickr Refer to caption (d) PPI
Figure 5: Comparison between DKL(𝜶v2||𝜶vLmax)D_{KL}(\bm{\alpha}_{v}^{2}||\bm{\alpha}_{v}^{L_{max}}).

V-B2 Comparison between Attention Coefficients at 2 Layers

Next, we confirmed that DeepGAT with a large number of layers can acquire appropriate attention coefficients as well as DeepGAT with a small number of layers. In concrete terms, we compute the KL divergence between attention coefficients 𝜶v2=(αv12,αv22,,αv|V|2)T\bm{\alpha}_{v}^{2}=(\alpha_{v1}^{2},\alpha_{v2}^{2},\ldots,\alpha_{v|V|}^{2})^{T} for DeepGAT with 2 layers and 𝜶vLmax\bm{\alpha}_{v}^{L_{max}} for DeepGAT with LmaxL_{max} layers. If the impact of over-smoothing on DeepGAT is small, α¯vul\overline{\alpha}_{vu}^{l} should be almost the same as o(v,u)o(v,u) for DeepGATs with both large and small numbers of layers. Therefore, by measuring the difference between 𝜶v2\bm{\alpha}_{v}^{2} and 𝜶vLmax\bm{\alpha}_{v}^{L_{max}} by Eq. (5), we can determine how accurately the features of the nodes adjacent to vv that belong to the same class as vv are convolved to the representation of vv in DeepGAT.

DKL(𝜶vL1||𝜶vL2)=uNv,αvuL20αvuL1logαvuL1αvuL2D_{KL}(\bm{\alpha}_{v}^{L_{1}}||\bm{\alpha}_{v}^{L_{2}})=\sum_{u\in N_{v},\alpha_{vu}^{L_{2}}\neq 0}\alpha_{vu}^{L_{1}}\log\frac{\alpha_{vu}^{L_{1}}}{\alpha_{vu}^{L_{2}}} (5)

Figure 5 shows a box-and-whisker diagram of DKL(𝜶v2||𝜶vLmax)D_{KL}(\bm{\alpha}_{v}^{2}||\bm{\alpha}_{v}^{L_{max}}) computed for various nodes vVv\in V. The figure shows that the variance of DKL(𝜶v2||𝜶vLmax)D_{KL}(\bm{\alpha}_{v}^{2}||\bm{\alpha}_{v}^{L_{max}}) for DeepGAT is smaller than that for GAT. Thus, DeepGAT with a large number of layers is trained to acquire similar attention coefficients to DeepGAT with a small number of layers. These results show that for each node vv, DeepGAT with larger layers convolves only a small number of features of vv’s neighborhood belonging to a different class than vv into the representation of vv compared with the original GAT.

V-B3 Error Rate eNe_{N} using the Nearest Neighbor Method

Refer to caption

(a) CS
Refer to caption
(b) Physics
Refer to caption
(c) Flickr

Figure 6: Error rate eNe_{N} using the nearest neighbor method.

In the discussion immediately following the proof of Lemma 2, we noted that in the presence of an oracle, the variance of the distribution that the representation 𝒉vl\bm{h}_{v}^{l} follows decreases as ll increases. The smaller the overlap between the distribution 𝒟0\mathcal{D}_{0} followed by the representation 𝒉vL\bm{h}_{v}^{L} of node vv belonging to class 0 and the distribution 𝒟1\mathcal{D}_{1} followed by the representation 𝒉uL\bm{h}_{u}^{L} of node uu belonging to class 1, the more accurate we can expect the classification to be. The overlap is called the Bayes error rate and it is the lowest possible error rate for any classifiers for the given dataset. A low rate suggests that a useful representation is obtained for classification by learning the representation. In practice, however, it is difficult to know the exact probability distribution 𝒟c\mathcal{D}_{c} from 𝒉vL\bm{h}_{v}^{L}. Thus, we estimate the upper and lower bounds of the Bayes error rate eBe_{B} with an error rate eNe_{N} by the nearest neighbor method with a large number of prototypes in the low-dimensional space. According to previous studies [4, 5],

eBeNeB(2πeB)2eB,e_{B}\leq e_{N}\leq e_{B}(2-\pi e_{B})\leq 2e_{B},

where π=|C||C|1\pi=\frac{|C|}{|C|-1}. Therefore, the upper and lower bounds of the Bayes error rate eBe_{B} are given by

12eN11πeNπeBeN\frac{1}{2}e_{N}\leq\frac{1-\sqrt{1-\pi e_{N}}}{\pi}\leq e_{B}\leq e_{N}

for 0eN0.50\leq e_{N}\leq 0.5. Figure 6 shows eNe_{N} for the CS, Physics and Flickr datasets when the number of layers LL was increased. Figure 6 does not include eNe_{N} for PPI, because PPI is the dataset for the multi-label node classification. In Fig. 6, the yellow broken line is obtained by replacing att(𝒚^vl1,𝒚^ul1)att(\hat{\bm{y}}_{v}^{l-1},\hat{\bm{y}}_{u}^{l-1}) in Line 5 of Algorithm 1 into the oracle o(v,u)o(v,u). The error rate represented by the line gradually decreases as LL increases, which demonstrates that the variance of the distribution 𝒟c\mathcal{D}_{c} for each class cc decreases as discussed after the proof of Lemma 2. In contrast, when α¯vu\overline{\alpha}_{vu} is computed from representations but not the oracle, eNe_{N} increases slightly with an increase of LL from 1. This is because α¯vu\overline{\alpha}_{vu} computed from the learned representation does not exactly match o(v,u)o(v,u). However, eNe_{N} for DeepGAT is always smaller than for GAT, which indicates that the overlap between the distributions 𝒟0\mathcal{D}_{0} and 𝒟1\mathcal{D}_{1} followed by representations 𝒉vL\bm{h}_{v}^{L} learned from DeepGAT is small.

VI Conclusion

GNNs tend to suffer from over-smoothing leading to performance degradation. Therefore, this paper introduced DeepGAT for predicting the class to which network nodes belong in a deep GAT. It avoids over-smoothing in a GAT by ensuring that nodes in different classes are not similar at each layer. Using DeepGAT to predict class labels, a 15-layer network could be constructed without the need to tune the number of layers. DeepGAT prevented over-smoothing and achieved a 15-layer GAT with similar performance to a 2-layer GAT, as indicated by the similar attention coefficients. DeepGAT enables the training of a large network to acquire similar attention coefficients to a network with few layers. It avoids the over-smoothing problem and obviates the need to tune the number of layers, thus saving time and enhancing GNN performance.

References

  • [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. Proc. of International Conference on Learning Representations, 2015.
  • [2] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun: Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View. Proc. of the AAAI Conference on Artificial Intelligence, 3438–3445, 2020.
  • [3] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li: Simple and Deep Graph Convolutional Networks. Proc. of International Conference on Machine Learning, 1725–1735, 2020.
  • [4] Thomas M. Cover and Peter E. Hart: Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27, 1967.
  • [5] Keinosuke Fukunaga and Donald M. Hummels: Bias of Nearest Neighbor Error Estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1), 103–112, 1987.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun: Deep Residual Learning for Image Recognition. Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
  • [7] Junjie Huang, Huawei Shen, Liang Hou, and Xueqi Cheng: Signed Graph Attention Networks. Proc. of ICANN Workshop and Special Sessions, 566–577, 2019.
  • [8] Jun Kato, Airi Mita, Keita Gobara, and Akihiro Inokuchi. Deep Graph Attention Networks. Proc. of International Workshop on GPU Computing and AI, 2024 (to appear).
  • [9] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. Proc. of International Conference on Learning Representations, 2017.
  • [10] T. Konstantin Rusch, Benjamin Paul Chamberlain, Michael W. Mahoney, Michael M. Bronstein, and Siddhartha Mishra: Gradient Gating for Deep Multi-Rate Learning on Graphs. Proc. of International Conference on Learning Representations, 2023.
  • [11] T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra: A Survey on Oversmoothing in Graph Neural Networks. arXiv preprint arXiv:2303.10993, 2023.
  • [12] Guohao Li, Matthias Müller, Ali K. Thabet, and Bernard Ghanem: DeepGCNs: Can GCNs Go As Deep As CNNs? Proc of IEEE/CVF International Conference on Computer Vision, 9266–9275, 2019.
  • [13] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. Proc. of AAAI Conference on Artificial Intelligence, 3538–3545, 2018.
  • [14] Thang Luong, Hieu Pham, and Christopher D. Manning: Effective Approaches to Attention-based Neural Machine Translation. Proc. of Conference on Empirical Methods in Natural Language Processing, 1412–1421, 2015.
  • [15] Kenta Oono and Taiji Suzuki: Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. Proc. of International Conference on Learning Representations, 2020.
  • [16] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. Proc. of International Conference on Learning Representations, 2020.
  • [17] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjing Wang, and Yu Sun: Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification. Proc. of International Joint Conference on Artificial Intelligence, 1548–1554, 2021.
  • [18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. Proc. of Annual Conference on Neural Information Processing Systems, 5998–6008, 2017.
  • [19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph Attention Networks. Proc. of International Conference on Learning Representations, 2018.
  • [20] Hongwei Wang and Jure Leskovec: Unifying Graph Convolutional Neural Networks and Label Propagation. arXiv preprint arXiv:2002.06755. 2020.
  • [21] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu: Heterogeneous Graph Attention Network. The World Wide Web Conference, 2022–2032, 2019.
  • [22] Yangkun Wang, Jiarui Jin, Weinan Zhang, Yong Yu, Zheng Zhang, David Wipf: Bag of Tricks of Semi-Supervised Classification with Graph Neural Networks. arXiv preprint arXiv:2103.13355, 2021.
  • [23] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? Proc. of International Conference on Learning Representations, 2019.
  • [24] Yuta Yajima and Akihiro Inokuchi. Why Deeper Graph Neural Network Performs Worse? Discussion and Improvement About Deep GNNs. Proc. of International Conference on Artificial Neural Networks (2). 731–743, 2022.
  • [25] Xiaocheng Yang, Mingyu Yan, Shirui Pan, Xiaochun Ye, and Dongrui Fan: Simple and Efficient Heterogeneous Graph Neural Network. Proc. of AAAI Conference on Artificial Intelligence, 10816–10824, 2023.
  • [26] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung: GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs. Proc. of Conf. on Uncertainty in Artificial Intelligence, 339–349, 2018.
  • [27] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung: Relational Graph Attention Network for Aspect-based Sentiment Analysis. Proc. of Annual Meeting of Association for Computational Linguistics, 3229–3238, 2020.
  • [28] Lingxiao Zhao and Leman Akoglu: PairNorm: Tackling Oversmoothing in GNNs. Proc. of International Conference on Learning Representations, 2020.
  • [29] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun: Graph Neural Networks: A review of Methods and Applications. AI Open 1: 57–81, 2020.