Graph Attention Retrospective

Kimon Fountoulakis^∗† Amit Levi^∗‡ Shenghao Yang^∗†

Aseem Baranwal^† Aukosh Jagannath^§

Abstract

Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an “easy” regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the “hard” regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention’s robustness against structural noise in the graph. In particular, our robustness result implies that graph attention can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data.

^*^*footnotetext: Equal contribution.^$\dagger$^$\dagger$footnotetext: David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada.^$\ddagger$^$\ddagger$footnotetext: Huawei Noah’s Ark Lab, Montreal, Canada.^{$\mathsection$}^{$\mathsection$}footnotetext: Department of Statistics and Actuarial Science, Department of Applied Mathematics, University of Waterloo, Waterloo, Canada.⁰⁰footnotetext: Emails: [email protected], [email protected], [email protected],
[email protected], [email protected]

1 Introduction

Graph learning has received a lot of attention recently due to breakthrough learning models [21, 42, 13, 18, 23, 5, 16, 22, 28] that are able to exploit multi-modal data that consist of nodes and their edges as well as the features of the nodes. One of the most important problems in graph learning is the problem of classification, where the goal is to classify the nodes or edges of a graph given the graph and the features of the nodes. Two of the most fundamental mechanisms for classification, and graph learning in general, are graph convolution and graph attention. Graph convolution, usually defined using its spatial version, corresponds to averaging the features of a node with the features of its neighbors [28].¹¹1Although the model in [28] is related to spectral convolutions, it is mainly a spatial convolution since messages are propagated along graph edges. More broadly, graph convolution can refer to different variants arising from different (approximations of) graph spectral filters. We provide details in Section 2. Graph attention [44] mechanisms augment this convolution by appropriately weighting the edges of a graph before spatially convolving the data. Graph attention is able to do this by using information from the given features for each node. Despite its wide adoption by practitioners [19, 46, 25] and its large academic impact as well, the number of works that rigorously study its effectiveness is quite limited.

One of the motivations for using a graph attention mechanism as opposed to a simple convolution is the expectation that the attention mechanism is able to distinguish inter-class edges from intra-class edges and consequently weights inter-class edges and intra-class edges differently before performing the convolution step. This ability essentially maintains the weights of important edges and significantly reduces the weights of unimportant edges, and thus it allows graph convolution to aggregate features from a subset of neighbor nodes that would help node classification tasks. In this work, we explore the regimes in which this heuristic picture holds in simple node classification tasks, namely classifying the nodes in a contextual stochastic block model (CSBM) [8, 17]. The CSBM is a coupling of the stochastic block model (SBM) with a Gaussian mixture model, where the features of the nodes within a class are drawn from the same component of the mixture model. For a more precise definition, see Section 2. We focus on the case of two classes where the answer to the above question is sufficiently precise to understand the performance of graph attention and build useful intuition about it. We briefly and informally summarize our contributions as follows:

1.

In the “easy regime”, i.e., when the distance between the means of the Gaussian mixture model is much larger than the standard deviation, we show that there exists a choice of attention architecture that distinguishes inter-class edges from intra-class edges with high probability (Theorem 3). In particular, we show that the attention coefficients for one class of edges are much higher than the other class of edges (Corollary 4). Furthermore, we show that these attention coefficients lead to a perfect node classification result (Corollary 5). However, in the same regime, we also show that the graph is not needed to perfectly classify the data (Proposition 7).
2.

In the “hard regime”, i.e., when the distance between the means is small compared to the standard deviation, we show that every attention architecture is unable to distinguish inter-class from intra-class edges with high probability (Theorem 9). Moreover, we show that using the original Graph Attention Network (GAT) architecture [44], with high probability, most of the attention coefficients are going to have uniform weights, similar to those of simple graph convolution [28] (Theorem 10). We also consider a setting where graph attention is provided with independent and perfect edge information, and we show that even in this case, if the distance between the means is sufficiently small, then graph attention would fail to (almost) perfectly classify the nodes with high probability (Theorem 13).
3.

Beyond perfect node classification, we show that graph attention is able to assign a significantly higher weight to self-loops, irrespective of the parameters of the Gaussian mixture model that generates node features or the stochastic block model that generates the graph. Consequently, we show that with a high probability, graph attention convolution is at least as good as the best linear classifier for node features (Theorem 14). In a high structural noise regime when the graph is not helpful for node classification, i.e., when intra-class edge probability $p$ is close to inter-class edge probability $q$ , this means that graph attention is strictly better than simple graph convolution because it is able to essentially ignore the graph structural noise (Corollary 15). In addition, we obtain lower bounds on graph attention’s performance for almost perfect node classification and partial node classification (Corollary 16).
4.

We provide an extensive set of experiments both on synthetic data and on three popular real-world datasets that validates our theoretical results.

1.1 Limitations of our theoretical setting

In this work, to study the benefits and the limitations of the graph attention mechanism, we focus on node classification tasks using the CSBM generative model with two classes and Gaussian node features. This theoretical setting has a few limitations. First, we note that many real-world networks are much more complex and may exhibit different structures than the ones obtainable from a stochastic block model. Furthermore, real-world node attributes such as categorical features may not even follow a continuous probability distribution. Apparently, there are clear gaps between CSBM and the actual generative processes of real-world data. Nonetheless, we note that a good understanding of graph attention’s behavior over CSBM would already provide us with useful insights. For example, it has been shown empirically that synthetic graph datasets based on CSBM constitute good benchmarks for evaluating various GNN architectures [40]. To that end, in Section 6, we summarize some practical implications of our theoretical results which could be useful for practitioners working with GNNs.

Second, there are different levels of granularity for machine learning problems on graphs. Besides node-level tasks such as node classification, other important problems include edge-level tasks such as link prediction and graph-level tasks such as graph classification. While, empirically, variants of graph attention networks have been successfully applied to solve problems at all levels of granularity, our theoretical results mostly pertain to the effects of graph attention for node classification. This is a limitation on the scope of our study. On the other hand, we note that edge-level and graph-level tasks are typically solved by adding a pooling layer which combines node representations from previous graph (attention) convolution layers. Consequently, the quality of graph attention convolution’s output for a node not only affects node classification but also has a major influence on link prediction and graph classification. Therefore, our results may be extended to study the effects of graph attention on link prediction and graph classification under the CSBM generative model. For example, link prediction is closely related to edge classification which we extensively study in Section 3. For graph classification, one may consider the problem of classifying graphs generated from different parameter settings of the CSBM. In this case, our results on graph attention’s impact on node representations may be used to establish results on graph attention’s impact on graph representations after certain pooling layers. In particular, if graph attention convolution generates good node representations that are indicative of node class memberships, then one can show that the graph representation obtained from appropriately aggregating node representations would be indicative of the clustering structure of the graph, and hence the graph representation would be useful for graph classification tasks.

1.2 Relevant work

Recently the concept of attention for neural networks [6, 43] was transferred to graph neural networks [32, 10, 44, 30, 41]. A few papers have attempted to understand the attention mechanism in [44]. One work relevant to ours is [11]. In this paper, the authors show that a node may fail to assign large edge weight to its most important neighbors due to a global ranking of nodes generated by the attention mechanism in [44]. Another related work is [29], which presents an empirical study of the ability of graph attention to generalize on larger, complex, and noisy graphs. In addition, in [24] the authors propose a different metric to generate the attention coefficients and show empirically that it has an advantage over the original GAT architecture.

Other related work to ours, which does not focus on graph attention, comes from the field of statistical learning on random data models. Random graphs and the stochastic block model have been traditionally used in clustering and community detection [1, 4, 39]. Moreover, the works by [8, 17], which also rely on CSBM are focused on the fundamental limits of unsupervised learning. Of particular relevance is the work by [7], which studies the performance of graph convolution on CSBM as a semi-supervised learning problem. Within the context of random graphs, [27] studies the approximation power of graph neural networks on random graphs. In [37] the authors derive generalization error of graph neural networks for graph classification and regression tasks. In our paper we are interested in understanding graph attention’s capability for edge/node classification with respect to different parameter regimes of CSBM.

Finally, there are a few related theoretical works on understanding the generalization and representation power of graph neural networks [14, 15, 49, 47, 20, 34, 35]. For a recent survey in this direction see [26]. Our work takes a statistical perspective which allows us to characterize the precise performance of graph attention compared to graph convolution and no convolution for CSBM, with the goal of answering the particular questions that we imposed above.

2 Preliminaries

In this section, we describe the Contextual Stochastic Block Model (CSBM) [17] which serves as our data model, and the Graph Attention mechanism [44]. We also define notations and terms that are frequently used throughout the paper.

For a vector $x\in\mathbb{R}^{n}$ and $n\in\mathbbm{N}$ , the norm $\|x\|$ denotes the Euclidean norm of $x$ , i.e. $\|x\|\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{i\in[n]}x_{i}^{2}$ . We write $[n]\stackrel{{\scriptstyle\rm def}}{{=}}\{1,2,\ldots,n\}$ . We use $\operatorname{Ber}(p)$ to denote the Bernoulli distribution, so $x\sim\operatorname{Ber}(p)$ means the random variable $x$ takes value 1 with probability $p$ and 0 with probability $1-p$ . Let $d,n\in\mathbbm{N}$ , and ${\epsilon}_{1},\ldots,{\epsilon}_{n}\sim\operatorname{Ber}(1/2)$ . Define two classes as $C_{k}=\{j\in[n]\mid{\epsilon}_{j}=k\}$ for $k\in\{0,1\}$ . For each index $i\in[n]$ , we set the feature vector $\mathbf{X}_{i}\in\mathbb{R}^{d}$ as $\mathbf{X}_{i}\sim N((2{\epsilon}_{i}-1)\boldsymbol{\mu},\sigma^{2}\mathbf{I})$ , where $\boldsymbol{\mu}\in\mathbb{R}^{d}$ , $\sigma\in\mathbb{R}$ and $\mathbf{I}\in\{0,1\}^{d\times d}$ is the identity matrix.²²2The means of the mixture of Gaussians are $\pm\boldsymbol{\mu}$ . Our results can be easily generalized to general means. The current setting makes our analysis simpler without loss of generality. For a given pair $p,q\in[0,1]$ we consider the stochastic adjacency matrix $\mathbf{A}\in\{0,1\}^{n\times n}$ defined as follows. For $i,j\in[n]$ in the same class (i.e., intra-class edge), we set $a_{ij}\sim\operatorname{Ber}(p)$ , and if $i,j$ are in different classes (i.e., inter-class edge), we set $a_{ij}\sim\operatorname{Ber}(q)$ . We denote by $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ a sample obtained according to the above random process.

An advantage of CSBM is that it allows us to control the noise by controlling the parameters of the distributions of the model. In particular, CSBM allows us to control the distance of the means and the variance of the Gaussians, which are important for controlling the separability of the Gaussians. For example, fixing the variance, then the closer the means are the more difficult the separation of the node Gaussian features becomes. Another notable advantage of CSBM is that it allows us to control the structural noise and homophily level of the graph. The level of noise in the graph is controlled by increasing or reducing the gap between the intra-class edge probability $p$ and the inter-class edge probability $q$ , and the level of homophily in the graph is controlled by the relative magnitude between $p$ and $q$ . For example, when $p$ is much larger than $q$ , then a node is more likely to be connected with a node from the same class, and hence we obtain a homophilous graph; when $q$ is much larger than $p$ , then a node is more likely to be connected with a node from a different class, and hence we obtain a heterophilous graph. There are several recent works exploring the behavior of GNNs in heterophilous graphs [33, 48, 9, 36]. Interestingly, we note that our results for graph attention’s behavior over the CSBM data model do not depend on whether the graph is homophilous or heterophilous.

Given node representations $\boldsymbol{h}_{i}\in\mathbb{R}^{F^{\prime}}$ for $i\in[n]$ , a (spatial) graph convolution for node $i$ has output

\boldsymbol{h}_{i}^{\prime}=\sum_{j\in[n]}\mathbf{A}_{ij}c_{ij}\mathbf{W}\boldsymbol{h}_{j},\quad c_{ij}=\bigg{(}\sum_{\ell\in[n]}\mathbf{A}_{i\ell}\bigg{)}^{-1},

where $\mathbf{W}\in\mathbb{R}^{F\times F^{\prime}}$ is a learnable matrix. Throughout this paper, we will refer to this operation as simple graph convolution or standard graph convolution. Our definition of graph convolution is essentially the mean aggregation step in a general GNN layer [22]. The normalization constant $c_{ij}$ in our definition is closely related to the symmetric normalization $c_{ij}=(\sum_{\ell\in[n]}\mathbf{A}_{i\ell})^{-1/2}(\sum_{\ell\in[n]}\mathbf{A}_{j\ell})^{-1/2}$ used in the original Graph Convolutional Network (GCN) introduced by [28]. Our definition does not affect the discussions or implications we have for GCN with symmetric normalization. More broadly, there are other forms of graph convolution in the literature [12, 16, 31], which we do not compare within this work.

A single-head graph attention applies some weight function on the edges based on their node features (or a mapping thereof). Given two representations $\boldsymbol{h}_{i},\boldsymbol{h}_{j}\in\mathbb{R}^{F^{\prime}}$ for two nodes $i,j\in[n]$ , the attention model/mechanism is defined as the mapping

\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{j})\stackrel{{\scriptstyle\rm def}}{{=}}\alpha(\mathbf{W}\boldsymbol{h}_{i},\mathbf{W}\boldsymbol{h}_{j})

where $\alpha:\mathbb{R}^{F}\times\mathbb{R}^{F}\to\mathbb{R}$ and $\mathbf{W}\in\mathbb{R}^{F\times F^{\prime}}$ is a learnable matrix. The attention coefficient for a node $i$ and its neighbor $j$ is defined as

\displaystyle\gamma_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\exp(\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{j}))}{\sum_{\ell\in N_{i}}\exp(\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{\ell}))},

(1)

where $N_{i}$ is the set of neighbors of node $i$ that also includes node $i$ itself. Let $f$ be some element-wise activation function (which is usually nonlinear), the graph attention convolution output for a node $i\in[n]$ is given by

\begin{split}{\boldsymbol{h}}^{\prime}_{i}&=\sum_{j\in[n]}\mathbf{A}_{ij}\gamma_{ij}\mathbf{W}\boldsymbol{h}_{j},\\ \tilde{\boldsymbol{h}}_{i}&=f(\boldsymbol{h}^{\prime}_{i}).\end{split}

(2)

A multi-head graph attention [44] uses $K\in\mathbbm{N}$ weight matrices $\mathbf{W}^{1}$ , $\ldots$ , $\mathbf{W}^{K}\in\mathbb{R}^{F\times F^{\prime}}$ and averages their individual (single-head) outputs. We consider the most simplified case of a single graph attention layer (i.e., $F^{\prime}=d$ and $F=1$ ) where $\alpha$ is realized by an MLP using the LeakyRelu activation function. The LeakyRelu activation function is defined as $\mathrm{LeakyRelu}(x)=x$ if $x\geq 0$ and $\mathrm{LeakyRelu}(x)=\beta x$ for some constant $\beta\in[0,1)$ if $x<0$ .

The CSBM model induces node features $\mathbf{X}$ which are correlated through the graph $G=([n],E)$ , represented by an adjacency matrix $\mathbf{A}$ . A natural requirement of attention architectures is to maintain important edges in the graph and ignore unimportant edges. For example, important edges could be the set of intra-class edges and unimportant edges could be the set of inter-class edges. In this case, if graph attention mains all intra-class edges and ignores all inter-class edges, then a node from a class will be connected only to nodes from its own class. More specifically, a node $v$ will be connected to neighbor nodes whose associated node features come from the same distribution as node features of $v$ . Given two sets $A$ and $B$ , we denote $A\times B\stackrel{{\scriptstyle\rm def}}{{=}}\{(i,j):i\in A,j\in B\}$ and $A^{2}\stackrel{{\scriptstyle\rm def}}{{=}}A\times A$ .

3 Separation of edges and its implications for graph attention’s ability for perfect node classification

In this section, we study the ability of graph attention to separate important edges from unimportant edges. In addition, we study the implications of graph attention’s behavior for node classification. In particular, we are interested in whether or not the graph attention convolution is able to perfectly classify the nodes in the graph. Depending on the parameter regimes of the CSBM, we separate the results into two parts. In Section 3.1, we study graph attention’s behavior when the distance between the means of the node features is large. We construct a specific attention architecture and demonstrate its ability to separate intra-class edges from inter-class edges. Consequently, we show that the corresponding graph attention convolution is able to perfectly classify the nodes. In Section 3.2, we study graph attention’s behavior when the distance between the means of the node features is small. We show that with high probability no attention architecture is able to separate intra-class edges from inter-class edges. Consequently, we conjecture that no graph attention is able to perfectly classify the nodes. We provide evidence of this conjecture in Section 3.2.1, where we assume the existence of a strong attention mechanism that is able to perfectly classify the edges independently from node features. We show that using this “idealistic” attention mechanism still fails to (almost) perfectly classify the nodes, provided that the distance between the means of the node features is sufficiently small.

The two-parameter regimes of the CSBM that we consider in Section 3.1 and Section 3.2 are as follows. The first (“easy regime”) is where $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ , and the second (“hard regime”) is where $\|\boldsymbol{\mu}\|=\kappa\sigma$ for some $0<\kappa\leq O(\sqrt{\log n})$ . We start by defining edge separability and node separability.

Definition 1 (Edge separability).

Given an attention model $\Psi$ , we say that the model separates the edges, if the outputs satisfy

\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=-\operatorname{sign}(\Psi(\mathbf{X}_{k},\mathbf{X}_{\ell}))

whenever $(i,j)$ is an intra-class edge, i.e. $(i,j)\in(C_{1}^{2}\cup C_{0}^{2})\cap E$ , and $(k,\ell)$ is an inter-class edge, i.e. $(k,\ell)\in E\setminus(C_{1}^{2}\cup C_{0}^{2})$ .

Definition 2 (Node separability).

Given a classification model which outputs a scalar representation $h^{\prime}_{i}$ for node $i$ , we say that the model separates the nodes if $h^{\prime}_{i}>0$ whenever $i\in C_{1}$ and $h^{\prime}_{i}<0$ whenever $i\in C_{0}$ .

Some of our results in this section rely on a mild assumption that lower bounds the sparsity of the graph generated by the CSBM model. This assumption says that the expected degree of a node in the graph is larger than $\log^{2}n$ . It is required to obtain a concentration of node degrees property used in the proofs. While this assumption may limit a direct application of our results to very sparse graphs, for example, graphs whose node degrees are bounded by a constant, it is mainly an artifact of the proof techniques that we use. We expect that similar results still hold for sparser graphs, but to rigorously remove the assumption on graph density one may have to adopt a different proof technique.

Assumption 1.

$p,q=\Omega(\log^{2}n/n)$ .

3.1 “Easy Regime”

In this regime $\left(\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})\right)$ we show that a two-layer MLP attention is able to separate the edges with high probability. Consequently, we show that the corresponding graph attention convolution is able to separate the nodes with high probability. At a high level, we transform the problem of classifying an edge $(i,j)\in E$ into the problem of classifying a point $[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}]$ in $\mathbb{R}^{2}$ , where $\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ is a unit vector that maximizes the total pairwise distances among the four means given below. When we consider the set of points $[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}]$ for all $(i,j)\in E$ , we can think of each point as a two-dimensional Gaussian vector whose mean is one of the following: $[\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ , $[-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ , $[\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ , $[-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ . The set of intra-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either $[\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ or $[-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ , while the set of inter-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either $[-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ or $[\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ . Therefore, in order to correctly classify the edges, we need to correctly classify the data corresponding to means $[\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ and $[-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]$ as 1, and classify the data corresponding to the other means as $-1$ . This problem is known in the literature as the “XOR problem” [38]. To achieve this we consider a two-layer MLP architecture $\Psi$ which separates the first and the third quadrants of the two-dimensional space from the second and the fourth quadrants. In particular, we consider the following specification of $\Psi(\mathbf{X}_{i},\mathbf{X}_{j})$ ,

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})\stackrel{{\scriptstyle\rm def}}{{=}}\boldsymbol{r}^{T}\mathrm{LeakyRelu}\left(\mathbf{S}\begin{bmatrix}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\\ \tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\end{bmatrix}\right),

(3)

where

\tilde{\boldsymbol{w}}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\boldsymbol{\mu}}{\|\boldsymbol{\mu}\|},\quad\mathbf{S}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{bmatrix}\hskip 8.53581pt1&\hskip 8.53581pt1\\ -1&-1\\ \hskip 8.53581pt1&-1\\ -1&\hskip 8.53581pt1\end{bmatrix},\quad\boldsymbol{r}\stackrel{{\scriptstyle\rm def}}{{=}}R\cdot\begin{bmatrix}\hskip 8.53581pt1\\ \hskip 8.53581pt1\\ -1\\ -1\end{bmatrix},

(4)

where $R>0$ is an arbitrary scaling parameter. The particular function $\Psi$ has been chosen such that it is able to classify the means of the XOR problem correctly, that is,

\displaystyle\operatorname{sign}(\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}]))=\left\{\begin{array}[]{ll}\hskip 8.53581pt1,&\mbox{if $(i,j)$ is an intra-class edge},\\ -1,&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

To see this, we can take look at the following simplified expression of $\Psi$ , which is obtained by substituting the specifications of $\mathbf{S}$ and $\boldsymbol{r}$ from (4) to (3),

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

(5)

Then, by substituting $\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ into (5), one easily verifies that

\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}])=\left\{\begin{array}[]{ll}\hskip 8.53581pt2R(1-\beta)\|\boldsymbol{\mu}\|,&\mbox{if $(i,j)$ is an intra-class edge},\\ -2R(1-\beta)\|\boldsymbol{\mu}\|,&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

(6)

On the other hand, our assumption that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ guarantees that the distance between the means of the XOR problem is much larger than the standard deviation of the Gaussians, and thus with high probability there is no overlap between the distributions. More specifically, one can show that with a high probability,

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}\hskip 8.53581pt2R(1-\beta)\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ -2R(1-\beta)\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

(7)

This property guarantees that with high probability,

\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=\operatorname{sign}(\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}])),\ \mbox{for all}\ (i,j)\in E,

which implies perfect separability of the edges. We state this result below in Theorem 3 and provide detailed arguments in the appendix.

Theorem 3.

Suppose that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ . Then with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , the two-layer MLP attention architecture $\Psi$ given in (3) and (4) separates intra-class edges from inter-class edges.

Our analysis of edge separability by the attention architecture $\Psi$ has two important implications. First, edge separability by $\Psi$ implies a nice concentration result for the attention coefficients $\gamma_{ij}$ defined in (1). In particular, one can show that the attention coefficients have the desirable property to maintain important edges and drop unimportant edges. Second, the nice distinguishing power of the attention coefficients in turn implies the exact recoverability of the node memberships using the graph attention convolution. We state these two results below in Corollary 4 and Corollary 5, respectively. Denote

\Psi^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\cdot\Psi,

that is, $\Psi^{\prime}=\Psi$ if $p\geq q$ and $\Psi^{\prime}=-\Psi$ if $p<q$ , where $\Psi$ is the two-layer MLP attention architecture given in (3) and (4). As it will become clear later, the attention architecture $\Psi^{\prime}$ allows attention coefficients to assign more weights to either intra-class or inter-class edges, depending on if $p\geq q$ or $p<q$ , and this will help us obtain a perfect separation of the nodes.

Corollary 4.

Suppose that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ and that Assumption 1 holds. Then with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , the attention architecture $\Psi^{\prime}$ yields attention coefficients $\gamma_{ij}$ such that

1.

If $p\geq q$ , then $\gamma_{ij}=\frac{2}{np}(1\pm o(1))$ if $(i,j)$ is an intra-class edge and $\gamma_{ij}=o(\frac{1}{n(p+q)})$ otherwise;
2.

If $p<q$ , then $\gamma_{ij}=\frac{2}{nq}(1\pm o(1))$ if $(i,j)$ is an inter-class edge and $\gamma_{ij}=o(\frac{1}{n(p+q)})$ otherwise.

Corollary 5.

Suppose that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ and that Assumption 1 holds. Then with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , using the attention architecture $\Psi^{\prime}$ with the graph attention convolution given in (2), where $f$ is set to be the identify function, the model separates the nodes.

Corollary 4 shows the desired behavior of the attention mechanism, namely it is able to assign significantly large weights to important edges while it drops unimportant edges. When $p\geq q$ , the attention mechanism maintains intra-class edges and essentially ignores all inter-class edges; when $p<q$ , it maintains inter-class edges and essentially ignores all intra-class edges. We now explain the high-level idea of the proof and leave detailed arguments to Appendix A.2. Corollary 4 follows from the concentration of $\Psi(\mathbf{X}_{i},\mathbf{X}_{j})$ around $\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}])$ given by (7). Assume for a moment that $p\geq q$ so that $\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q}=1$ . Then $\Psi^{\prime}=\Psi$ , and (7) implies that the value of $\exp(\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{j}))$ when $(i,j)$ is an intra-class edge is exponentially larger than the value of $\exp(\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{j}))$ when $(i,j)$ is an inter-class edge. Therefore, by the definition of the attention coefficients in (1), the denominator of $\gamma_{ij}$ is dominated by terms $(i,k)$ where $k$ is in the same class as $i$ . Moreover, using concentration of node degrees guaranteed by Assumption 1, each node $i$ is connected to $\frac{2}{np}(1\pm o(1))$ many intra-class nodes. By appropriately setting the scaling parameter $R$ in (4), the values of $\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{k})$ for all intra-class edges $(i,k)$ can be made within $o(1)$ multiplicative factor from each other. Therefore we get $\gamma_{ij}=\frac{2}{np}(1\pm o(1))$ when $(i,j)$ is an intra-class edge. A similar reasoning applies to inter-class edges and yields the vanishing value of $\gamma_{ij}$ when $(i,j)$ is an inter-class edge. Finally, the argument for $p<q$ follows analogously.

The concentration result of attention coefficients in Corollary 4 implies the node classification result in Corollary 5, which holds for any $p,q$ satisfying Assumption 1. That is, even when the graph structure is noisy, e.g., when $p\approx q$ , it is still possible to exactly recover the node class memberships. Essentially, by applying Corollary 4 and carrying out some straightforward algebraic simplifications, one can show that with high probability for all $i\in[n]$ ,

h^{\prime}_{i}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}=\left\{\begin{array}[]{ll}-(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $i\in C_{0}$},\\ \hskip 8.53581pt(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $i\in C_{1}$},\end{array}\right.

and hence the model separates the nodes with high probability. We provide details in the appendix.

While Corollary 5 provides a positive result for graph attention, it can be shown that a simple linear classifier which does not use the graph at all achieves perfect node separability with high probability. In particular, the Bayes optimal classifier for the node features without the graph is able to separate the nodes with high probability. This means that in this regime, using the additional graph information is unnecessary, as it does not provide additional power compared to a simple linear classifier for the node classification task.

Lemma 6 (Section 6.4 in [3]).

Let $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ . Then the optimal Bayes classifier for $\mathbf{X}$ is realized by the linear classifier

h(\mathbf{X}_{i})=\begin{cases}0&\text{if }\boldsymbol{\mu}^{T}\mathbf{X}_{i}\leq 0\\ 1&\text{if }\boldsymbol{\mu}^{T}\mathbf{X}_{i}>0\end{cases}.

(8)

Proposition 7.

Suppose $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ . Then with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , the linear classifier given in (8) separates the nodes.

The proof of Proposition 7 is elementary. To see the claim one may show that the probability that the classifier in (8) misclassifies a node $i\in[n]$ is $o(1)$ . To do this, let us fix $i\in[n]$ and write $\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ where $\boldsymbol{g}_{i}\sim N(0,\mathbf{I})$ . Assume for a moment ${\epsilon}_{i}=0$ . Then the probability of misclassification is

\mathop{{\bf Pr}\/}\left[\boldsymbol{\mu}^{T}\mathbf{X}_{i}>0\right]=\mathop{{\bf Pr}\/}\left[\frac{\boldsymbol{\mu}^{T}\boldsymbol{g}_{i}}{\|\boldsymbol{\mu}\|}>\frac{\|\boldsymbol{\mu}\|}{\sigma}\right]=1-\Phi\left(\frac{\|\boldsymbol{\mu}\|}{\sigma}\right),

where $\Phi(\cdot)$ is the cumulative distribution function of $N(0,1)$ and the last equality follows from the fact that $\frac{\boldsymbol{\mu}^{T}\boldsymbol{g}_{i}}{\|\boldsymbol{\mu}\|}\sim N(0,1)$ . The assumption that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ implies $\|\boldsymbol{\mu}\|\geq\sigma\sqrt{2\log n}$ for large enough $n$ . Therefore, using standard tail bounds for normal distribution [45] we have that

1-\Phi\left(\frac{\|\boldsymbol{\mu}\|}{\sigma}\right)\leq\frac{\sigma}{\sqrt{2\pi}\|\boldsymbol{\mu}\|}\exp\left(-\frac{\|\boldsymbol{\mu}\|^{2}}{2\sigma^{2}}\right)\leq\frac{n^{-1}}{\sqrt{4\pi\log n}}.

This means that the probability that there exists $i\in C_{0}$ which is misclassified is at most $\frac{1}{2\sqrt{4\pi\log n}}=o(1)$ . A similar argument can be applied to the case where ${\epsilon}_{i}=1$ , and an application of a union bound on the events that there is $i\in[n]$ which is misclassified finishes the proof of Proposition 7.

3.2 “Hard Regime”

In this regime ( $\|\boldsymbol{\mu}\|=\kappa\sigma$ for $\kappa\leq O(\sqrt{\log n})$ ), we show that every attention architecture $\Psi$ fails to separate the edges if $\kappa<\sqrt{2\log n}$ , and we conjecture that no graph attention convolution is able to separate the nodes. The conjecture is based on our node separability result in Section 3.2.1 which says that, even if we assume that there is an attention architecture which is able to separate the edges independently from node features, the corresponding graph attention convolution still fails to (almost) perfectly classify the nodes with high probability.

The goal of the attention mechanism is to decide whether an edge $(i,j)$ is an inter-class edge or an intra-class edge based on the node feature vectors $\mathbf{X}_{i}$ and $\mathbf{X}_{j}$ . Let $\mathbf{X}^{\prime}_{ij}$ denote the vector obtained from concatenating $\mathbf{X}_{i}$ and $\mathbf{X}_{j}$ , that is,

\mathbf{X}^{\prime}_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\mathbf{X}_{i}\\ \mathbf{X}_{j}\end{pmatrix}.

(9)

We would like to analyze every classifier $h^{\prime}$ which takes as input $\mathbf{X}^{\prime}_{ij}$ and tries to separate inter-class edges and intra-class edges. An ideal classifier would have the property

y=h^{\prime}(\mathbf{X}^{\prime}_{ij})=\left\{\begin{array}[]{ll}0,&\mbox{if $(i,j)$ is an inter-class edge},\\ 1,&\mbox{if $(i,j)$ is an intra-class edge}.\end{array}\right.

(10)

To understand the limitations of all such classifiers in this regime, it suffices to consider the Bayes optimal classifier for this data model, whose probability of misclassifying of an arbitrary edge lower bounds that of every attention architecture which takes as input $(\mathbf{X}_{i},\mathbf{X}_{j})$ . Consequently, by deriving a misclassification rate for the Bayes classifier, we obtain a lower bound on the misclassification rate for every attention mechanism $\Psi$ for classifying intra-class and inter-class edges. The following Lemma 8 describes the Bayes classifier for this classification task.

Lemma 8.

Let $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ and let $\mathbf{X}^{\prime}_{ij}$ be defined as in (9). The Bayes optimal classifier for $\mathbf{X}^{\prime}_{ij}$ is realized by the following function,

h^{*}(\boldsymbol{x})=\left\{\begin{array}[]{ll}0,&\text{if}\ p\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right),\\ 1,&\text{otherwise},\end{array}\right.

(11)

where $\boldsymbol{\mu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}\end{pmatrix}$ and $\boldsymbol{\nu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ -\boldsymbol{\mu}\end{pmatrix}$ .

Using Lemma 8, we can lower bound the rate of misclassification of edges that every attention mechanism $\Psi$ exhibits. Below we define $\Phi_{\mathrm{c}}\stackrel{{\scriptstyle\rm def}}{{=}}1-\Phi$ , where $\Phi$ is the cumulative distribution function of $N(0,1)$ .

Theorem 9.

Suppose $\|\boldsymbol{\mu}\|=\kappa\sigma$ for some $\kappa>0$ and let $\Psi$ be any attention mechanism. Then,

1.

With probability at least $1-o(1)$ , $\Psi$ fails to correctly classify at least $2\cdot\Phi_{\mathrm{c}}(\kappa)^{2}$ fraction of inter-class edges;
2.

For any $K>0$ if $q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}}$ , then with probability at least $1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n})$ , $\Psi$ misclassify at least one inter-class edge.

Part 1 of Theorem 9 implies that if $\|\boldsymbol{\mu}\|$ is linear in the standard deviation $\sigma$ , that is if $\kappa=O(1)$ , then with overwhelming probability the attention mechanism fails to distinguish a constant fraction of inter-class edges from intra-class edges. Furthermore, part 2 of Theorem 9 characterizes a regime for the inter-class edge probability $q$ where the attention mechanism fails to distinguish at least one inter-class edge. It provides a lower bound on $q$ in terms of the scale at which the distance between the means grows compared to the standard deviation $\sigma$ . This aligns with the intuition that as we increase the distance between the means, it gets easier for the attention mechanism to correctly distinguish inter-class and intra-class edges. However, if $q$ is also increased with the right proportion, in other words, if the noise in the graph is increased, then the attention mechanism would still fail to correctly distinguish at least one inter-class edge. For instance, for $\kappa=\sqrt{2\log\log n}$ and $K=\log^{2}n$ , we get that if $q>\Omega(\frac{\log^{6+o(1)}n}{n})$ , then with probability at least $1-o(1)$ , $\Psi$ misclassifies at least an inter-class edge.

The proof of Theorem 9 relies on analyzing the behavior of the Bayes optimal classifier in (11). We compute an upper bound on the probability with which the optimal classifier correctly classifies an arbitrary inter-class edge. Then the proof of part 1 of Theorem 9 follows from a concentration argument for the fraction of inter-class edges that are misclassified by the optimal classifier. For part 2, we use a similar concentration argument to choose a suitable threshold for $q$ that forces the optimal classifier to fail on at least one inter-class edge. We provide formal arguments in the appendix.

As a motivating example of how the attention mechanism would fail and what exactly the attention coefficients would behave in this regime, we focus on one of the most popular attention architecture [44], where $\alpha$ is a single-layer neural network parameterized by $(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R}$ with $\mathrm{LeakyRelu}$ activation function. Namely, the attention coefficients are defined by

\gamma_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\exp\left(\mathrm{LeakyRelu}\left({\boldsymbol{a}}^{T}\begin{bmatrix}\boldsymbol{w}^{T}\mathbf{X}_{i}\\ \boldsymbol{w}^{T}\mathbf{X}_{j}\end{bmatrix}+b\right)\right)}{\sum_{\ell\in N_{i}}\exp\left(\mathrm{LeakyRelu}\left({\boldsymbol{a}}^{T}\begin{bmatrix}\boldsymbol{w}^{T}\mathbf{X}_{i}\\ \boldsymbol{w}^{T}\mathbf{X}_{\ell}\end{bmatrix}+b\right)\right)}.

(12)

We show that, as a consequence of the inability of the attention mechanism to distinguish intra-class and inter-class edges, with overwhelming probability most of the attention coefficients $\gamma_{ij}$ given by (12) are going to be $\Theta(1/|N_{i}|)$ . In particular, Theorem 10 says that for the vast majority of nodes in the graph, the attention coefficients on most edges are uniform irrespective of whether the edge is inter-class or intra-class. As a result, this means that the attention mechanism is unable to assign higher weights to important edges and lower weights to unimportant edges.

Theorem 10.

Assume that $\|\boldsymbol{\mu}\|\leq K\sigma$ and $\sigma\leq K^{\prime}$ for some absolute constants $K$ and $K^{\prime}$ . Moreover, assume that the parameters $(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R}$ are bounded. Then, with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , there exists a subset $\mathcal{A}\subseteq[n]$ with cardinality at least $n(1-o(1))$ such that for all $i\in\mathcal{A}$ the following hold:

1.

There is a subset $J_{i,0}\subseteq N_{i}\cap C_{0}$ with cardinality at least $\frac{9}{10}|N_{i}\cap C_{0}|$ , such that $\gamma_{ij}=\Theta(1/|N_{i}|)$ for all $j\in J_{i,0}$ .
2.

There is a subset $J_{i,1}\subseteq N_{i}\cap C_{1}$ with cardinality at least $\frac{9}{10}|N_{i}\cap C_{1}|$ , such that $\gamma_{ij}=\Theta(1/|N_{i}|)$ for all $j\in J_{i,1}$ .

Theorem 10 is proved by carefully computing the numerator and the denominator in (12). In this regime, $\|\boldsymbol{\mu}\|$ is not much larger than $\sigma$ , that is, the signal does not dominate noise, so the numerator in (12) is not indicative of the class memberships of nodes $i,j$ but rather acts like Gaussian noise. On the other hand, denote the denominator in (12) by $\delta_{i}$ and observe that it is the same for all $\gamma_{i\ell}$ where $\ell\in N_{i}$ . Using concentration arguments about $\{\boldsymbol{w}^{T}\mathbf{X}_{\ell}\}_{\ell\in[n]}$ yields $\gamma_{ij}=\Theta(1/\delta_{i})$ and $\delta_{i}=\Theta(|N_{i}|)$ finishes up the proof. We provide details in the appendix.

Compared to the easy regime, it is difficult to obtain a separation result for the nodes without additional assumptions. In the easy regime, the distance between the means was much larger than the standard deviation, which made the “signal” (the expectation of the convolved data) dominate the “noise” (i.e., the variance of the convolved data). In the hard regime the “noise” dominates the “signal”. Thus, we conjecture the following.

Conjecture 11.

There is an absolute constant $M>0$ such that, whenever $\|\boldsymbol{\mu}\|\leq M\cdot\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\cdot\frac{p+q}{|p-q|}$ , every graph attention model fails to perfectly classify the nodes with high probability.

The above conjecture means that in the hard regime, the performance of the graph attention model depends on $q$ as opposed to the easy regime, where in Corollary 5 we show that it doesn’t. This property is verified by our synthetic experiments in Section 5. The quantity $\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}$ in the threshold comes from our conjecture that the expected maximum “noise” of the graph attention convolved data over the nodes is at least $c\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}$ for some constant $c>0$ . The quantity $\frac{p+q}{|p-q|}$ in the threshold comes from our conjecture that the distance between the means (i.e. “signal”) of the graph attention convolved data is reduced to at most $|p-q|/(p+q)$ of the original distance. Proving Conjecture 11 would require delicate treatment of the correlations between the attention coefficients $\gamma_{ij}$ and the node features $\mathbf{X}_{i}$ for $i\in[n]$ .

3.2.1 Are good attention coefficients helpful in the “hard regime”?

In this subsection we are interested in understanding the implications of edge separability on node separability in the hard regime and when $\Psi$ is restricted to a specific class of functions. In particular, we show that Conjecture 11 is true under an additional assumption that $\Psi$ does not depend on the node features. In addition, we show that even if we were allowed to use an “extremely good” attention function $\tilde{\Psi}$ which separates the edges with an arbitrarily large margin, with high probability the graph attention convolution (2) will still misclassify at least one node as long as $\|\boldsymbol{\mu}\|/\sigma$ is sufficiently small.

We consider the class of functions $\tilde{\Psi}$ which can be expressed in the following form:

\tilde{\Psi}(i,j)=\left\{\begin{array}[]{ll}\hskip 9.95845pt\operatorname{sign}(p-q)t,&\mbox{if $(i,j)$ is an intra-class edge},\\ -\operatorname{sign}(p-q)t,&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right.

(13)

for some $t\geq 0$ . The particular class of functions in (13) is motivated by the property of the ideal edge classifier in (10) and the behavior of $\Psi$ in (6) when it is applied to the means of the Gaussians. There are a few possible ways to obtain a function $\tilde{\Psi}$ which satisfies (13). For example, in the presence of good edge features which reflect the class memberships of the edges, we can make $\tilde{\Psi}$ take as input the edge features. Moreover, if $|\sqrt{p}-\sqrt{q}|>\sqrt{2\log n/n}$ , one such $\tilde{\Psi}$ may be easily realized from the eigenvectors of the graph adjacency matrix. By the exact spectral recovery result in Lemma 12, we know that there exists a classifier $\hat{\tau}$ which separates the nodes. Therefore, we can set $\tilde{\Psi}(i,j)=\operatorname{sign}(p-q)t$ if $\hat{\tau}(i)=\hat{\tau}(j)$ and $\tilde{\Psi}(i,j)=-\operatorname{sign}(p-q)t$ otherwise.

Lemma 12 (Exact recovery in [1]).

Suppose that $p,q=\Omega(\log^{2}n/n)$ and $|\sqrt{p}-\sqrt{q}|>\sqrt{2\log n/n}$ . Then there exists a classifier $\hat{\tau}$ taking as input the graph $\mathbf{A}$ and perfectly classifies the nodes with probability at least $1-o(1)$ .

Theorem 13.

Suppose that $p,q$ satisfy Assumption 1 and that $p,q$ are bounded away from 1. For every $\epsilon>0$ , there are absolute constants $M,M^{\prime}=O(\sqrt{\epsilon})$ such that, with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , using the graph attention convolution in (2) and the attention architecture $\tilde{\Psi}$ in (13), the model misclassifies at least $\Omega(n^{1-\epsilon})$ nodes for any $\boldsymbol{w}$ such that $\|\boldsymbol{w}\|=1$ , if

1.

$t=O(1)$ and $\|\boldsymbol{\mu}\|\leq M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|}$ ;
2.

$t=\omega(1)$ and $\|\boldsymbol{\mu}\|\leq M^{\prime}\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}$ .

Theorem 13 warrants some discussions. We start with the role of $t$ in the attention function (13). One may think of $t$ as the multiplicative margin of separation for intra-class and intra-class edges. When $t=O(1)$ , the margin of separation is at most a constant. This includes the special case when $\tilde{\Psi}(i,j)=0$ for all $(i,j)\in E$ , i.e, the margin of separation is 0. In this case, the graph attention convolution in (2) reduces to the standard graph convolution with uniform averaging among the neighbors. Therefore, part 1 of Theorem 13 also applies to the standard graph convolution. On the other hand, when $t=\omega(1)$ , the margin of separation is not only bounded away from $0$ but also it grows with $n$ .

Next, we discuss the additional assumption that $p,q$ are bounded away from 1. This assumption is used to obtain a concentration result required for the proof of Theorem 13. It is also intuitive in the following sense. If both $p$ and $q$ are arbitrarily close to 1, then after the convolution the convolved node feature vectors collapse to approximately a single point, and thus this becomes a trivial case where no classifier is able to separate the nodes; on the other hand, if $p$ is arbitrarily close to 1 and $q$ is very small, then after the convolution the convolved node feature vectors collapse to approximately one of two points according to which class the node comes from, and in this case the nodes can be easily separated by a linear classifier.

We now focus on the threshold for $\|\boldsymbol{\mu}\|$ under which the model is going to misclassify at least one node with high probability. In part 1 of Theorem 13, $t=O(1)$ , i.e., the attention mechanism $\tilde{\Psi}$ is either unable to separate the edges or unable to separate the edges with a large enough margin. In this case, one can show that all attention coefficients are $\Theta(\frac{1}{n(p+q)})$ . Consequently, the quantity $|p-q|$ appears in denominator of the threshold for $\|\boldsymbol{\mu}\|$ in part 1 of Theorem 13. Because of that, if $p$ and $q$ are arbitrarily close, then the model is not able to separate the nodes irrespective of how large $\|\boldsymbol{\mu}\|$ is. For example, treating $1-\max(p,q)$ as a constant since $p$ and $q$ are bounded away from 1 by assumption, we have that

|p-q|=o\left(\sqrt{\frac{p+q}{n}}\right)\ \mbox{implies}\ M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|}=\omega(\sigma\sqrt{\log n}).

This means that if $p$ and $q$ are close enough, every attention function $\tilde{\Psi}$ in the form of (13) and $t=O(1)$ cannot help classify all nodes correctly even if $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ . On the contrary, recall that in the easy regime where $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ , the attention mechanism given in (3) and (4) helps separate the nodes with high probability. This illustrates the limitation of every attention mechanism in the form of (13) which have an insignificant margin of separation. According to Theorem 10, the vast majority of attention coefficients are uniform, and thus in Conjecture 11 we expect that graph attention in general shares similar limitations in the hard regime.

In part 2 of Proposition 13, $t=\omega(1)$ , i.e., the attention mechanism $\tilde{\Psi}$ separates the edges with a large margin. In this case, one can show that the attention coefficients on important edges (e.g. intra-class edges) are exponentially larger than those on unimportant edges (e.g. inter-class edges). Consequently, the factor $(p+q)/|p-q|$ no longer appears in the threshold for $\|\boldsymbol{\mu}\|$ in part 2 of Theorem 13. However, at the same time, the threshold also implies that, even when we have a perfect attention mechanism that is able to separate the edges with a large margin, as long as $\|\boldsymbol{\mu}\|/\sigma$ is small enough, then the model is going to misclassify at least one node with high probability.

Finally, notice that in Theorem 13 the parameter $\epsilon$ captures a natural tradeoff between the threshold for $\|\boldsymbol{\mu}\|$ and the lower bound on the number of misclassified nodes. Namely, the smaller the $\epsilon$ is, the smaller the threshold for $\|\boldsymbol{\mu}\|$ becomes, and hence the less signal there is in the node features, the more mistakes the model is going to make. This is precisely demonstrated by the scaling of $M,M^{\prime}=O(\sqrt{\epsilon})$ and misclassification bound $\Omega(n^{1-\epsilon})$ with respect to $\epsilon$ . We leave the proof of Theorem 13 to the appendix.

4 Robustness to structural noise and its implications beyond perfect node classification

In this section, we provide a positive result on the capacity of graph attention convolution for node classification beyond the perfect classification regime. In particular, we show that independent of the parameters of CSBM, i.e., independent of $p$ , $q$ , $\boldsymbol{\mu}$ and $\sigma$ , the two-layer MLP attention architecture $\Psi$ from Section 3.1 is able to achieve the classification performance obtainable by the Bayes optimal classifier for node features. This is proved by showing that there is a parameter setting for $\Psi$ where the attention coefficient on self-loops can be made exponentially large. Consequently, the corresponding graph attention convolution behaves like a linear function of node features. We provide two corollaries of this result. The first corollary provides a ranking of graph attention convolution, simple graph convolution, and linear function in terms of their ability to classify nodes in CSBM. More specifically, by noticing that the simple graph convolution is also realized by a specific parameter setting of the attention architecture $\Psi$ , our result implies that the performance of graph attention convolution for node classification in CSBM is lower bounded by the best possible performance between a linear classifier and simple graph convolution. In particular, graph attention is strictly more powerful than simple graph convolution when the graph is noisy (e.g. when $p\approx q$ ), and it is strictly more powerful than a linear classifier when the graph is less noisy (e.g. when $p$ and $q$ are significantly different). The second corollary provides perfect classification, almost perfect classification, and partial classification results for graph attention convolution. It follows immediately from the reduction of graph attention convolution to a linear function under the specification of $\Psi$ that we will discuss. In what follows we start with high-level ideas, then we present formal statements of the results, and we discuss the implications in detail.

Recall the two-layer MLP attention architecture $\Psi$ from (3) and (4) is equivalently written in (5) as

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

We make the following observations. Assuming that the scaling parameter $R>0$ ,

•

If $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0$ , then the function $\Psi$ assigns more weight to an edge $(i,j)$ such that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0$ than an edge $(i,j^{\prime})$ such that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}}<0$ ;
•

If $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0$ , then the function $\Psi$ assigns more weight to an edge $(i,j)$ such that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0$ than an edge $(i,j^{\prime})$ such that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}}>0$ ;
•

If $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}=0$ , then the function $\Psi$ assigns uniform weight to every edge $(i,j)$ .

This means that the behavior of $\Psi$ depends on which side of the hyperplane $\{\boldsymbol{x}:\tilde{\boldsymbol{w}}^{T}\boldsymbol{x}=0\}$ that $\mathbf{X}_{i}$ comes from. In other words, for fixed $\mathbf{X}_{j}$ , whether the attention function $\Psi$ will up-weight or down-weight an edge $(i,j)$ depends entirely on the classification of $\mathbf{X}_{i}$ based on the linear classifier $\tilde{\boldsymbol{w}}$ . Moreover, note that the attention function value satisfies

2R(1-\beta)\cdot\max\{-|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|,-|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\}\leq\Psi(\mathbf{X}_{i},\mathbf{X}_{j})\leq 2R(1-\beta)\cdot\min\{|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|,|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\}.

Therefore, out of all unit norm vectors $\boldsymbol{w}$ , our choice $\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ maximizes the range of values that $\Psi$ can output. Recall from Lemma 6 that $\tilde{\boldsymbol{w}}$ also happens to characterize the Bayes optimal classifier for the node features. Finally, the attention function $\Psi$ achieves minimum/maximum at self-attention, i.e.

	$\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{i})$	$\displaystyle=2R(1-\beta)\|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\|=\max_{j\in[n]}\Psi(\mathbf{X}_{i},\mathbf{X}_{j}),$
	$\displaystyle\Psi(\mathbf{X}_{i},-\mathbf{X}_{i})$	$\displaystyle=-2R(1-\beta)\|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\|=\min_{j\in[n]}\Psi(\mathbf{X}_{i},\mathbf{X}_{j}).$

A consequence of these observations is that, by setting the scaling parameter $R$ sufficiently large, one can make $\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))$ exponentially larger than $\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j^{\prime}}))$ for any $j,j^{\prime}$ such that $\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})$ and $\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}})\neq\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})$ . According to the definition of attention coefficients in (1), this means that the attention coefficients $\gamma_{ij}$ where $\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})$ are going to be exponentially larger than the attention coefficients $\gamma_{ij^{\prime}}$ where $\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}})\neq\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})$ . Therefore, one could expect that in this case, if the linear classifier $\tilde{\boldsymbol{w}}$ correctly classifies $\mathbf{X}_{i}$ for some $i\in[n]$ , e.g., for $i\in C_{1}$ this means that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0$ , then graph attention convolution output $h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$ also satisfies $h_{i}^{\prime}>0$ , due to sufficiently larger attention coefficients $\gamma_{ij}$ for which $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0$ . We state the result below in Theorem 14 and leave detailed arguments in the appendix.

Theorem 14.

With probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , using the two-layer MLP attention architecture $\Psi$ given in (3) and (4) with $R=\Omega(n\log^{2}n/\sigma)$ , the graph attention convolution output satisfies

	$\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,\;\forall i\in[n],$
	$\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0,\;\forall i\in[n].$

Theorem 14 means that there is a parameter setting for the attention architecture $\Psi$ such that the performance of graph attention convolution matches with the performance of the Bayes optimal classifier for node features. This shows the ability of graph attention to “ignore” the graph structure, which can be beneficial when the graph is noisy. For example, if $p=q$ , then it is easy to see that simple graph convolution completely mixes up the node features, making it not possible to achieve any meaningful node classification result. On the other hand, as long as there is some signal from the original node features, i.e. $\|\boldsymbol{\mu}\|>0$ , then graph attention will be able to pick that up and classify the nodes at least as good as the best classifier for the node features alone. It is also worth noting that by setting $R=0$ , the attention function $\Psi$ has a constant value, and hence graph attention convolution reduces to the standard graph convolution, which has been shown to be useful in the regime where the original node features are not very strong but the graph has a nice structure [7]. For example, when there is a significant gap between $p$ and $q$ , $|p-q|/(p+q)=\Omega(1)$ , then setting $R=0$ could significantly improve the node separability threshold over the best linear classifier [7]. This shows the robustness of graph attention against noise in one of the two sources of information, namely node features and edge connectivities. Unlike the Bayes optimal classifier for node features which is sensitive to feature noise or the simple graph convolution which is sensitive to structural noise, one can expect graph attention to work as long as at least one of the two sources of information has a good signal. Therefore, we obtain a ranking of node classification models among graph attention convolution, simple graph convolution, and a linear classifier. We state this below in Corollary 15.

Corollary 15.

The node classification performance obtainable by graph attention convolution is lower bounded by the best possible node classification performance between a simple graph convolution and a linear classifier.

By a straightforward characterization of the performance of the linear classifier $\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ , we immediately obtain the following classification results in Corollary 16. Recall that we denoted $\Phi$ as the cumulative distribution function of the standard normal distribution. Write

\|\boldsymbol{\mu}\|=\kappa\sigma\;\ \mbox{for some}\;\ \kappa>0.

Corollary 16.

•

(Perfect classification) If $\kappa\geq\sqrt{2\log n}$ then all nodes are correctly classified;
•

(Almost perfect classification) If $\kappa=\omega(1)$ then at least $1-o(1)$ fraction of all nodes are correctly classified;
•

(Partial classification) If $\kappa=O(1)$ then at least $\Phi(\kappa)-o(1)$ fraction of all nodes are correctly classified.

Interestingly, we note that the perfect classification result in Corollary 16 is nearly equivalent (up to a small order in the threshold $\kappa$ ) to the perfect classification result in Corollary 5 from Section 3.1. They are obtained from different behaviors of the attention coefficients. This shows that there could be more than one type of “good” attention coefficients that are able to deliver good node classification performance.

5 Experiments

In this section, we demonstrate empirically our results on synthetic and real data. The parameters of the models that we experiment with are set by using an ansatz based on our theorems. The particular details are given in Section 5.1. For real datasets, we use the default splits which come from PyTorch Geometric [19] and OGB [25]. In all our experiments we use MLP-GAT, where the attention mechanism $\Psi$ is set to be the two-layer network in (3) and (4) with $R=1$ . For synthetic experiments using CSBM with known $p$ and $q$ , we use the variant that takes $p,q$ into account, $\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\Psi$ . In Figure 2(b) and Figure 3(b) we additionally consider the original GAT architecture of [44] to demonstrate Theorem 10.

5.1 Ansatz for GAT, MLP-GAT and GCN

For the original GAT architecture we fix $\boldsymbol{w}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ and define the first head as $\boldsymbol{a}_{1}=\frac{1}{\sqrt{2}}(1,1)$ and $b_{1}=-\frac{1}{\sqrt{2}}\boldsymbol{w}^{T}\boldsymbol{\mu}$ ; The second head is defined as $\boldsymbol{a}_{2}=-\boldsymbol{a}_{1}$ and $b_{2}=-b_{1}$ . We now discuss the choice of such an ansatz. The parameter $\boldsymbol{w}$ is picked based on the optimal Bayes classifier without a graph, and the attention is set such that the first head maintains intra-class edges in $C_{1}$ and the second head maintains intra-class edges in $C_{0}$ . Note that for the original GAT [44], due to the fact that the attention mechanism consists of just one layer (i.e. a nonlinear activation applied on a linear transformation, see (12)), it is not possible for the original GAT to keep only $\gamma_{ij}$ which correspond to intra-class edges. More specifically, one may use the same techniques in the proof of Theorem 3 and Corollaries 4 and 5 to prove the node separability results for the original GAT. In this particular case, the result will depend on $q$ in contrast to the result we get for MLP-GAT, where no dependence of $q$ was needed. For MLP-GAT we use the ansatz $\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\Psi$ where $\Psi$ is given in (3) and (4) with $R=1$ . This choice of two-layer network allows us to bypass the “XOR problem” [38] and separate inter-class from intra-class edges as shown in Theorem 3. Note that no single-layer architecture will be able to separate the edges due to the “XOR problem”. For GCN we used the ansatz from [7] which is also $\boldsymbol{w}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ .

5.2 Synthetic data

We use the CSBM to generate the data. In a recent work [40], a simple variant of the CSBM was also chosen as the default generative model for benchmarking GNN performance for node classification tasks. We present two sets of experiments. In the first set, we fix the distance between the means and vary $q$ , and in the second set, we fix $q$ and vary the distance. We set $n=1000$ , $d=n/\log^{2}(n)$ , $p=0.5$ and $\sigma=0.1$ . Results are averaged over $10$ trials.

5.2.1 Fixing the distance between the means and varying $q$

We consider the two regimes separately, where for the “easy regime” we fix the mean $\boldsymbol{\mu}$ to be a vector where each coordinate is equal to $10\sigma\sqrt{\log{n^{2}}}/2\sqrt{d}$ . This guarantees that the distance between the means is $10\sigma\sqrt{\log{n^{2}}}$ . In the “hard regime” we fix the mean $\boldsymbol{\mu}$ to a vector where each coordinate is equal to $\sigma/\sqrt{d}$ , and this guarantees that the distance is $\sigma$ . We fix $p=0.5$ and vary $q$ from $\log^{2}(n)/n$ to $1-\log^{2}(n)/n$ .

In Figure 1 we illustrate Theorem 3 and Corollaries 4, 5 for the easy regime, and in Figure 2 we illustrate Theorem 9 and Theorem 10 for the hard regime. In particular, in Figure 1(a) we show Theorem 3, MLP-GAT is able to classify intra-class and inter-class edges perfectly. In Figure 1(b) we show that in the easy regime, as claimed in Corollary 4 for MLP-GAT, when $p\geq q$ , the $\gamma$ that correspond to intra-class edges concentrate around $2/np$ , while the $\gamma$ for the inter-class edges concentrate to tiny values; when $p<q$ , we see the opposite, that is the $\gamma$ that correspond to intra-class edges concentrate to tiny values, while the $\gamma$ for the inter-class edges concentrate around $2/nq$ . In Figure 1(c) we observe that the performance of MLP-GAT for node classification is independent of $q$ in the easy regime as claimed in Corollary 5. However, in this plot, we observe that not using the graph also achieves perfect node classification, a result which is proved in Proposition 7. In the same plot, we also show the performance of simple graph convolution, where its performance depends on $q$ (see [7]). In Figure 2(a) we show Theorem 9. MLP-GAT misclassifies a constant fraction of the intra and inter edges as proved in Theorem 9. In Figure 2(b) we show Theorem 10, where $\gamma$ ’s in the hard regime concentrate around uniform (GCN) coefficients for both MLP-GAT and GAT. In Figure 2(c) we illustrate that node classification accuracy is a function of $q$ for MLP-GAT. This is conjectured in Conjecture 11. On the other hand, note that the performance of MLP-GAT is lower bounded by the performance of not using the graph, as proved in Theorem 14.

Refer to caption — (a) Edge classification

5.2.2 Fixing $q$ and varying the distance between the means

We consider the case where $q=0.1$ . In Figure 3 we show how the attention coefficients of MLP-GAT and GAT, the node and edge classification depend on the distance between the means. We also add a vertical line at $\sigma$ to approximately separate the easy (left of $\sigma$ ) and hard (right of $\sigma$ ) regimes. Figure 3(c) illustrates Theorems 3 and 9 in the hard and easy regimes, respectively. In particular, we observe that in the hard regime, MLP-GAT fails to distinguish intra from inter edges, while in the easy regime, it is able to do that perfectly for a large enough distance between the means.

In Figure 3(a) we observe that in the hard regime, $\gamma$ concentrate around the uniform (GCN) coefficients, while in the easy regime, MLP-GAT is able to maintain the $\gamma$ for intra-class edges, while it sets the $\gamma$ to tiny values for inter-class edges. In Figure 3(b). we observe that in the hard regime, the $\gamma$ of GAT concentrate around the uniform coefficients (proved in Theorem 10), while in the easy regime although the $\gamma$ concentrate, GAT is not able to distinguish intra from inter edges. This makes sense since the separation of edges can’t be done by simple linear classifiers used by GAT, see the discussion below Theorem 10. Finally, in Figure 3(d) we show node classification results for MLP-GAT. In the easy regime, we observe perfect classification as proved in Corollary 5. However, as the distance between the means decreases, we observe that MLP-GAT starts to misclassify nodes.

5.3 Real data

For the experiment on real data, we illustrate the attention coefficients, node and edge classification for MLP-GAT as a function of the distance between the means. We use popular real-world graph datasets Cora, PubMed, and CiteSeer collected by PyTorch Geometric [19] and ogbn-arxiv from Open Graph Benchmark [25]. The datasets come with multiple classes, however, for each of our experiments, we do a one-v.s.-all classification for a single class. This is a semi-supervised problem, only a fraction of the training nodes have labels. The rest of the nodes are used for measuring prediction accuracy. To control the distance between the means of the problem we use the true labels to determine the class of each node and then we compute the empirical mean for each class. We subtract the empirical means from their corresponding classes and we also add means $\boldsymbol{\mu}$ and $-\boldsymbol{\mu}$ to each class, respectively. This modification can be thought of as translating the mean of the distribution of the data for each class.

The results of this experiment are shown in Figure 4. For Cora, PubMed, and CiteSeer we show results for class 0 since these are small datasets, each dataset contains at most 7 classes, and the classes have similar sizes. In our experiments on other classes, we observed that the results are similar. For ogbn-arxiv we show results for the largest class (i.e. class 16) since this is a larger dataset which has 40 imbalanced classes. Picking a large class makes the one-v.s.-all classification task more balanced. In our experiments on other classes having similar sizes, we obtained similar results. We note that in the real data, we also observe similar behavior of MLP-GAT in the easy and hard regimes as for the synthetic data. In particular, for all datasets as the distance of means increases, MLP-GAT is able to accurately classify intra-class and inter-class edges, see Figures 4(a), 4(d) and 4(g). Moreover, as the distance between the means increases, the average intra-class $\gamma$ becomes much larger than the average inter-class $\gamma$ , see Figures 4(b), 4(e), 4(h), and 4(k), and the model is able to classify the nodes accurately, see Figures 4(c), 4(f), 4(i), and 4(l). On the contrary, in the same figures, we observe that as the distance between the means decreases then MLP-GAT is not able to separate intra-class from inter-class edges, the averaged $\gamma$ are very close to uniform coefficients and the model can’t classify the nodes accurately.

Note that Figure 4 does not show the standard deviation for the attention coefficients $\gamma$ . We show the standard deviation of $\gamma$ in Figure 5. We observe that the standard deviation is higher than what we observed in the synthetic data. In particular, it can be more than half of the averaged $\gamma$ . This is to be expected since for the real data the degrees of the nodes do not concentrate as well. In Figure 5 we show that the standard deviation of the uniform coefficients $1/|N_{i}|$ is also high. For Cora, PubMed, and CiteSeer, the standard deviation for intra-class $\gamma$ is similar to that of $1/|N_{i}|$ , while the deviation for inter-class $\gamma$ is large for a small distance between the means, but it gets much smaller as the distance increases. For ognb-arxiv, the standard deviation of $1/|N_{i}|$ is particularly high. This implies that the degree distribution of nodes of ogbn-arxiv has a heavy tail, which could potentially result from the graph structure being noisier than other datasets and also explain GCN’s relatively much worse performance when the distance between the means is large.

6 Summary of implications for practitioners

While this work focuses on theoretical understanding of graph attention’s capability for edge and node classification using the CSBM generative model, our findings yield a series of potential suggestions for GNN practitioners. In this section we provide some interesting practical implications of our results.

6.1 Why graph attention? Benefits of graph attention’s robustness to structural noise

When the graph is very noisy, e.g. when a node has a similar number of neighbors from the same class and from different classes, simple graph convolution will mix up node features and thus make nodes from different classes indistinguishable. In this case, simple graph convolution can do more harm than good. However, in practice, it is difficult to determine how noisy the graph is, or if the graph is even useful at all. This could pose a challenge in choosing an architecture. Our results in Theorem 14 and Corollary 15 imply that graph attention has the ability to dramatically reduce the impact of a noisy graph, in a way such that the output is at least as good as the output from the best linear classifier (on the input) which does not rely on the graph. This shows that graph attention convolution is more robust against structural noise in the graph, and hence on noisy graphs, it is strictly better than simple graph convolution.

6.2 Which attention architecture? Benefits of multi-layer attention architecture

In this work, we are able to obtain positive results for graph attention by using the two-layer MLP attention architecture in (3). This is different from the original GAT which uses a single-layer attention architecture [44]. In our analyses and empirical experiments, we found that the original single-layer attention does not have the important properties required for obtaining positive results (e.g. Theorem 3, Corollary 5, Theorem 14, Corollary 16) for graph attention convolution. Coincidentally, this aligns with the findings of [11], where the authors study limitations of the original GAT architecture from a different perspective than ours, and they propose a new architecture termed GATv2. The two-layer MLP attention architecture that we consider in (3) encompasses GATv2 as a special case. Therefore, the two-layer MLP attention architecture can be a good candidate to consider when practitioners search for a suitable graph attention architecture for their specific downstream tasks. On the other hand, our results in Section 3.2 imply that even the two-layer MLP attention architecture (and hence GATv2) has limitations when the node features are noisy. To fix that, a potential solution is to incorporate additional information such as edge features, which we discuss next.

6.3 Will additional information help? Benefits of incorporating good edge features

Even though we do not consider edge features in our analyses, our results in Theorem 13 imply that good attention functions that are able to classify the edges independently from the node features can be very helpful, as they help reduce the threshold under which graph attention convolution would fail to separate the nodes. One potential way to obtain good attention functions that behave like the one given in (13) is by incorporating good edge features. Furthermore, given our result in Theorem 9, which says that graph attention based on noisy node features cannot perfectly classify the edges, the importance of incorporating informative edge features that are more indicative of edge class memberships (if they are accessible in practice) into the attention mechanism is more pronounced.

7 Conclusion and future work

In this work, we study the impact of graph attention on edges and its implications for node classification. We show that graph attention improves robustness to noise in the graph structure. We also show that graph attention may not be very useful in a “hard” regime where the node features are noisy. Our work shows that single-layer graph attention convolution has limited power at distinguishing intra-class from inter-class edges. Given the empirical successes of graph attention and its many variants, a promising future work is to study the power of multi-layer graph attention convolutions for distinguishing intra-class and inter-class edges. Moreover, our negative results in Section 3.2 for edge/node classification pertains to perfect classification and almost perfect classification. In practice, misclassification of a small constant fraction of nodes/edges is often inevitable, but nonetheless useful. Therefore, an interesting future line of work is to characterize the threshold under which graph attention is going to misclassify a certain proportion of nodes. Finally, variants of graph attention networks have been successfully used in tasks other than node classification, such as link prediction and graph classification. These tasks are typically solved by architectures that add a final aggregation layer which combines node representations generated from graph attention convolution. It is an interesting future direction to develop a good understanding of the benefits and limitations of the graph attention mechanism for those tasks.

Acknowledgement

K. Fountoulakis would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2019-04067, DGECR-2019-00147].

A. Jagannath acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199].

Appendix A Proofs

We define the following high-probability events which will be used in some proofs. Each of these events holds with probability at least $1-o(1)$ , which follows from straightforward applications of Chernoff bound and union bound, e.g., see [7].

Definition 17.

Define the following events over the randomness of $\mathbf{A}$ and $\{{\epsilon}_{i}\}_{i\in[n]}$ and $\{\mathbf{X}_{i}\}_{i\in[n]}$ ,

•

$\boldsymbol{\mathcal{E}}_{1}$ is the event that $|C_{0}|=\frac{n}{2}\pm O(\sqrt{n\log n})$ and $|C_{1}|=\frac{n}{2}\pm O(\sqrt{n\log n})$ .
•

$\boldsymbol{\mathcal{E}}_{2}$ is the event that for each $i\in[n]$ , $\mathbf{D}_{ii}=\frac{n(p+q)}{2}\left(1\pm\frac{10}{\sqrt{\log n}}\right)$ .
•

$\boldsymbol{\mathcal{E}}_{3}$ is the event that for each $i\in[n]$ , $|C_{0}\cap N_{i}|=\mathbf{D}_{ii}\cdot\frac{(1-{\epsilon}_{i})p+{\epsilon}_{i}q}{p+q}\left(1\pm\frac{10}{\sqrt{\log n}}\right)$ and $|C_{1}\cap N_{i}|=\mathbf{D}_{ii}\cdot\frac{(1-{\epsilon}_{i})q+{\epsilon}_{i}p}{p+q}\left(1\pm\frac{10}{\sqrt{\log n}}\right)$ .
•

$\boldsymbol{\mathcal{E}}_{4}$ is the event that for each $i\in[n]$ , $\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}\left[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right]\right|\leq 10\sigma\sqrt{\log n}$ .
•

$\boldsymbol{\mathcal{E}}^{*}$ is the intersection of the above 4 events.

Lemma 18 ([7]).

With probability at least $1-o(1)$ event $\boldsymbol{\mathcal{E}}^{*}$ holds.

Some of our proofs also utilize the following simple observation on the mutual independence among $\{\boldsymbol{w}^{T}\boldsymbol{g}_{i}\}_{i\in[n]}$ when $\{\boldsymbol{g}_{i}\}_{i\in[n]}$ are i.i.d. Gaussian random vectors.

Observation 19.

Fix $\boldsymbol{w}\neq 0$ in $\mathbb{R}^{d}$ and let $\boldsymbol{g}_{1},\ldots,\boldsymbol{g}_{n}$ be i.i.d. drawn from $N(0,\mathbf{I})$ . Then $\boldsymbol{w}^{T}\boldsymbol{g}_{1},\boldsymbol{w}^{T}\boldsymbol{g}_{2},\ldots,\boldsymbol{w}^{T}\boldsymbol{g}_{n}$ are independent.

Proof: Note that since $\boldsymbol{w}^{T}\boldsymbol{g}_{i}\sim N(0,\|\boldsymbol{w}\|^{2})$ , it suffices to prove that the covariance $\mathop{{\bf E}\/}[\boldsymbol{w}^{T}\boldsymbol{g}_{i}\cdot\boldsymbol{w}^{T}\boldsymbol{g}_{j}]=0$ for all $i\neq j$ . By definition, for $i\neq j$ ,

\displaystyle\mathop{{\bf E}\/}[\boldsymbol{w}^{T}\boldsymbol{g}_{i}\cdot\boldsymbol{w}^{T}\boldsymbol{g}_{j}]

\displaystyle=\mathop{{\bf E}\/}\left[\sum_{k\in[d]}\sum_{\ell\in[d]}\boldsymbol{w}_{k}\boldsymbol{w}_{\ell}\boldsymbol{g}_{ik}\boldsymbol{g}_{j\ell}\right]=\sum_{k\in[d]}\sum_{\ell\in[d]}\boldsymbol{w}_{k}\boldsymbol{w}_{\ell}\mathop{{\bf E}\/}[\boldsymbol{g}_{ik}\boldsymbol{g}_{j\ell}]=0,

where the last equality follows from independence between $\boldsymbol{g}_{i}$ and $\boldsymbol{g}_{j}$ .

A.1 Proof of Theorem 3

We restate Theorem 3 for convenience.

Theorem.

Recall from (5) that

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

We will condition on the event that $\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}\left[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right]\right|\leq 10\sigma\sqrt{\log n}$ for all $i\in[n]$ , which holds with probability at least $1-O(1/n^{99})$ following a union bound and the Gaussian tail probability. Under this event, because $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ , for all $i,j\in C_{1}$ we have

	$\displaystyle\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=1,$
	$\displaystyle\min\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\geq\\|\boldsymbol{\mu}\\|-10\sigma\sqrt{\log n},$
	$\displaystyle\max\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\leq\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n},$

and hence

	$\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{j})$	$\displaystyle\geq 2R(1-\beta)\cdot\min\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\geq 2R(1-\beta)\cdot(\\|\boldsymbol{\mu}\\|-O(\sigma\sqrt{\log n})),$
	$\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{j})$	$\displaystyle\leq 2R(1-\beta)\cdot\max\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\leq 2R(1-\beta)\cdot(\\|\boldsymbol{\mu}\\|+O(\sigma\sqrt{\log n})),$

which implies that $\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n})$ for $i,j\in C_{1}$ . Similarly, we get that

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}\hskip 8.53581pt\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n})=\hskip 8.53581pt\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ -\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n})=-\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

(14)

Therefore, we probability at least $1-o(1)$ , we have that

\displaystyle\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=\left\{\begin{array}[]{ll}\hskip 8.53581pt1,&\mbox{if $(i,j)$ is an intra-class edge},\\ -1,&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right.

which means perfect separability of edges, and the proof is complete.

A.2 Proof of Corollary 4

We restate Corollary 4 for convenience.

Corollary.

1.

If $p\geq q$ , then $\gamma_{ij}=\frac{2}{np}(1\pm o(1))$ if $(i,j)$ is an intra-class edge and $\gamma_{ij}=o(\frac{1}{n(p+q)})$ otherwise;
2.

If $p<q$ , then $\gamma_{ij}=\frac{2}{nq}(1\pm o(1))$ if $(i,j)$ is an inter-class edge and $\gamma_{ij}=o(\frac{1}{n(p+q)})$ otherwise.

The proof is straightforward by considering the cases $p\geq q$ and $p<q$ separately. When $p\geq q$ , we have $\Psi^{\prime}=\Psi$ . Using the specification of $\Psi$ in (3) and (4), the definition of attention coefficients in (1), the high probability event in Lemma 18, the expression of $\Psi(\mathbf{X}_{i},\mathbf{X}_{j})$ in (14), and picking $R$ such that $1/R=\omega(\sigma\sqrt{\log n})$ and $1/R=o(\|\boldsymbol{\mu}\|)$ , we obtain the claimed results. The result when $p<q$ is obtained in the same way.

A.3 Proof of Corollary 5

We restate Corollary 5 for convenience.

Corollary.

Suppose that $\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})$ and that Assumption 1 holds. Then with probability at least $1-o(1)$ over the data $(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2})$ , using the attention architecture $\Psi^{\prime}$ with the graph attention convolution given in (2), where $f$ is set to be the identify function, the model separates the nodes.

We prove the case $p\geq q$ and the case $p<q$ follows analogously. Consider the attention architecture $\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\cdot\Psi=\Psi$ , where $\Psi$ is given in (3) and (4). Pick $R$ such that $1/R=\omega(\sigma\sqrt{\log n})$ and $1/R=o(\|\boldsymbol{\mu}\|)$ ). Assume that $i\in C_{1}$ , and denote the graph attention convolution output as

h_{i}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}.

We will condition on the event $\boldsymbol{\mathcal{E}}^{*}$ , which holds with probability at least $1-o(1)$ . By using Corollary 4 we have

	$\displaystyle\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$	$\displaystyle=\sum_{j\in C_{0}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}+\sum_{j\in C_{1}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$
		$\displaystyle\leq\|C_{1}\cap N_{i}\|\left(\frac{2}{np}(1\pm o(1))\left(\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)\right)$
		$\displaystyle\qquad+\|C_{0}\cap N_{i}\|\left(o\left(\frac{1}{n(p+q)}\right)\left(-\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)\right)$
		$\displaystyle=(1\pm o(1))\cdot\left(\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)-\frac{nq(1\pm o(1))}{\omega(n(p+q))}\cdot\left(\\|\boldsymbol{\mu}\\|-10\sigma\sqrt{\log n}\right)$
		$\displaystyle=\\|\boldsymbol{\mu}\\|(1\pm o(1)).$

Similarly, we have that

	$\displaystyle\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$	$\displaystyle\geq(1\pm o(1))\cdot\left(\\|\boldsymbol{\mu}\\|-10\sigma\sqrt{\log n}\right)-\frac{nq(1\pm o(1))}{\omega(n(p+q))}\cdot\left(\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)$
		$\displaystyle=\\|\boldsymbol{\mu}\\|(1\pm o(1)).$

This means that $h_{i}^{\prime}=\|\boldsymbol{\mu}\|(1\pm o(1))$ for $i\in C_{1}$ . Applying the same reasoning we get that $h_{i}^{\prime}=-\|\boldsymbol{\mu}\|(1\pm o(1))$ for $i\in C_{0}$ . Therefore, with probability at least $1-o(1)$ , the graph attention convolution separates the nodes.

A.4 Proof of Lemma 8

We restate Lemma 8 for convenience.

Lemma.

h^{*}(\boldsymbol{x})=\left\{\begin{array}[]{ll}0,&\text{if}\ p\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right),\\ 1,&\text{otherwise},\end{array}\right.

Proof: Note that $\mathbf{X}^{\prime}_{ij}$ is a mixture of $2d$ -dimensional Gaussian distributions,

\mathbf{X}^{\prime}_{ij}\sim\begin{cases}N(-\boldsymbol{\mu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{0},j\in C_{0}\\ N(\boldsymbol{\mu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{1},j\in C_{1}\\ N(-\boldsymbol{\nu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{0},j\in C_{1}\\ N(\boldsymbol{\nu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{1},j\in C_{0}\end{cases}.

The optimal classifier is then given by

h^{*}(\boldsymbol{x})=\operatorname*{arg\,max}_{c\in\{0,1\}}\mathop{{\bf Pr}\/}[y=c\mid\boldsymbol{x}].

Note that $\mathop{{\bf Pr}\/}[y=0]=\frac{q}{p+q}$ and $\operatorname{{\bf Pr}}[y=1]=\frac{p}{p+q}$ . Thus, by Bayes rule, we obtain that

	$\displaystyle\mathop{{\bf Pr}\/}[y=c\mid\boldsymbol{x}]$	$\displaystyle=\frac{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}\|y}(\boldsymbol{x}\mid y=c)}{\mathop{{\bf Pr}\/}[y=0]f_{\boldsymbol{x}\|y=0}(\boldsymbol{x}\mid y=0)+\mathop{{\bf Pr}\/}[y=1]f_{\boldsymbol{x}\|y=1}(\boldsymbol{x}\mid y=1)}$
		$\displaystyle=\frac{1}{1+\frac{\mathop{{\bf Pr}\/}[y=1-c]\cdot f_{\boldsymbol{x}\|y}(\boldsymbol{x}\mid y=1-c)}{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}\|y}(\boldsymbol{x}\mid y=c)}}.$

Suppose that $\boldsymbol{x}=\mathbf{X}^{\prime}_{ij}$ such that $i\nsim j$ . Then $h^{*}(\boldsymbol{x})=0$ if and only if $\mathop{{\bf Pr}\/}[y=0\mid\boldsymbol{x}]\geq\frac{1}{2}$ . Hence, for $c=0$ we require that

\frac{\mathop{{\bf Pr}\/}[y=1-c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=1-c)}{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=c)}=\frac{p}{q}\frac{f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=1)}{f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=0)}=\frac{p}{q}\frac{\cosh\left({\frac{1}{\sigma^{2}}\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}\right)}{\cosh\left({\frac{1}{\sigma^{2}}\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}\right)}\leq 1,

Similarly, we obtain the reverse condition for $h^{*}(\boldsymbol{x})=1$ .

A.5 Proof of Theorem 9

We restate Theorem 9 for convenience.

Theorem.

Suppose $\|\boldsymbol{\mu}\|=\kappa\sigma$ for some $\kappa>0$ and let $\Psi$ be any attention mechanism. Then,

1.

With probability at least $1-o(1)$ , $\Psi$ fails to correctly classify at least $2\cdot\Phi_{\mathrm{c}}(\kappa)^{2}$ fraction of inter-class edges;
2.

For any $K>1$ if $q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}}$ , then with probability at least $1-O(n^{-\frac{K}{4}\Phi_{\rm c}(\kappa)^{2}\log n})$ , $\Psi$ misclassify at least one inter-class edge.

We will write $i\sim j$ if node $i$ and node $j$ are in the same class and $i\nsim j$ otherwise. From Lemma 8, we observe that for successful classification by the optimal classifier, we need

	$\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)$	$\displaystyle\quad\text{for}\;i\nsim j,$
	$\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)$	$\displaystyle\quad\text{for}\;i\sim j.$

We will split the analysis into two cases. First, note that when $p\geq q$ we have for $i\nsim j$ that

\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies|\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}|\leq|\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}|.

In the first implication, we used that $p\geq q$ , while the second implication follows from the fact that $\cosh(a)\leq\cosh(b)\implies|a|\leq|b|$ for all $a,b\in\mathbb{R}$ . Similarly, for $p<q$ we have for $i\sim j$ that

\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies|\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}|>|\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}|.

Therefore, for each of the above cases, we can upper bound the probability for either $i\sim j$ or $i\nsim j$ that $\mathbf{X}^{\prime}_{ij}$ is correctly classified, by the probability of the event $|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}|\leq|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}|$ or equivalently $|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}|>|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}|$ . We focus on the former as the latter is equivalent and symmetric. Writing $\mathbf{X}_{i}=\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ and $\mathbf{X}_{j}=-\boldsymbol{\mu}+\sigma\boldsymbol{g}_{j}$ , we have that for $i\in C_{1}$ and $j\in C_{0}$ ,

	$\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0]$	$\displaystyle\leq\mathop{{\bf Pr}\/}\left[\|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}\|\leq\|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}\|\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\|\mathbf{X}_{i}^{T}\boldsymbol{\mu}+\mathbf{X}_{j}^{T}\boldsymbol{\mu}\|\leq\|\mathbf{X}_{i}^{T}\boldsymbol{\mu}-\mathbf{X}_{j}^{T}\boldsymbol{\mu}\|\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\sigma\|\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}+\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}\|\leq\|\pm 2\\|\boldsymbol{\mu}\\|^{2}+\sigma\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}-\sigma\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}\|\right]$
		$\displaystyle\leq\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|-\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|\leq\frac{2\\|\boldsymbol{\mu}\\|}{\sigma}\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|-\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|\leq 2\kappa\right],$

where we denote $\hat{\boldsymbol{\mu}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ . In the second to last step above, we used triangle inequality to pull $2\|\boldsymbol{\mu}\|^{2}$ outside the absolute value, while in the last equation we use $\|\boldsymbol{\mu}\|=\kappa\sigma$ .

We now denote $z_{i}=\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}$ for all $i\in[n]$ . Then the above probability is $\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa]$ , where $z_{i},z_{j}\sim N(0,1)$ are independent random variables. Note that we have

$\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0]$	$\displaystyle\leq\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa]$
	$\displaystyle=\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa,\|z_{i}\|\leq\kappa]$
	$\displaystyle\qquad+\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa,\|z_{i}\|>\kappa]$
	$\displaystyle=\mathop{{\bf Pr}\/}[\|z_{i}\|\leq\kappa]+\Phi(\kappa)\mathop{{\bf Pr}\/}[\|z_{i}\|>\kappa].$	(15)

To see how we obtain the last equation, observe that if $|z_{i}|\leq\kappa$ then we have

$\displaystyle\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|$	$\displaystyle=\|z_{i}+z_{j}\|-\|z_{j}-z_{i}\|$
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-\|z_{j}-z_{i}\|$	by triangle inequality
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-\big{\|}\|z_{j}\|-\|z_{i}\|\big{\|}$	by reverse triangle inequality
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-(\|z_{j}\|-\|z_{i}\|)=2\|z_{i}\|$
	$\displaystyle\leq 2\kappa,$

hence, $\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa,|z_{i}|\leq\kappa]=\mathop{{\bf Pr}\/}[|z_{i}|\leq\kappa]$ . On the other hand, for $|z_{i}|>\kappa$ , we look at each case, conditioned on the events $z_{i}>\kappa$ and $z_{i}<-\kappa$ for each of the four cases based on the signs of $z_{i}+z_{j}$ and $z_{i}-z_{j}$ . We denote by $E$ the event that $|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa$ , and analyze the cases in detail. First consider the case $z_{i}<-\kappa$ :

	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}\geq 0\mid z_{i}<-\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}\leq z_{i},z_{j}\geq-z_{i}\mid z_{i}<-\kappa]=0,$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}<0\mid z_{i}<-\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}>\|z_{i}\|,z_{i}\leq\kappa\mid z_{i}<-\kappa]=\Phi(z_{i}),$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}\geq 0\mid z_{i}<-\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-\|z_{i}\|,z_{i}\geq-\kappa\mid z_{i}<-\kappa]=0,$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}<0\mid z_{i}<-\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{i}<z_{j}<-z_{i},z_{j}>-\kappa\mid z_{i}<-\kappa]$
		$\displaystyle=\Phi(\kappa)-\Phi(z_{i}).$

The sum of the four probabilities in the above is $\mathop{{\bf Pr}\/}[E\mid z_{i}<-\kappa]=\Phi(\kappa)$ . Similarly, we analyze the other case, $z_{i}>\kappa$ :

	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}\geq 0\mid z_{i}>\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[-z_{i}\leq z_{j}\leq z_{i},z_{j}\leq\kappa\mid z_{i}>\kappa]$
		$\displaystyle=\Phi(\kappa)-\Phi_{\rm c}(z_{i}),$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}<0\mid z_{i}>\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}>\|z_{i}\|,z_{i}\leq\kappa\mid z_{i}>\kappa]=0,$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}\geq 0\mid z_{i}>\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-\|z_{i}\|,z_{i}\geq-\kappa\mid z_{i}>\kappa]=\Phi_{\rm c}(z_{i}),$
	$\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}<0\mid z_{i}>\kappa]$	$\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-z_{i},z_{j}>z_{i}\mid z_{i}>\kappa]=0.$

The sum of the four probabilities above is $\mathop{{\bf Pr}\/}[E\mid z_{i}>\kappa]=\Phi(\kappa)$ . Therefore, we obtain that

\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa\mid|z_{i}|>\kappa]=\Phi(\kappa),

which justifies (15).

Next, note that $\mathop{{\bf Pr}\/}[|z_{i}|\leq\kappa]=\Phi(\kappa)-\Phi_{\rm c}(\kappa)$ and $\mathop{{\bf Pr}\/}[|z_{i}|>\kappa]=2\Phi_{\rm c}(\kappa)$ , so we have from (15) that

	$\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0]$	$\displaystyle\leq\Phi(\kappa)-\Phi_{\rm c}(\kappa)+2\Phi_{\rm c}(\kappa)\Phi(\kappa)$
		$\displaystyle=1-2\Phi_{\rm c}(\kappa)+2\Phi_{\rm c}(\kappa)\Phi(\kappa)=1-2\Phi_{\rm c}(\kappa)^{2}.$

Thus, $\mathbf{X}^{\prime}_{ij}$ is misclassified with probability at least $2\Phi_{\rm c}(\kappa)^{2}$ .

We will now construct sets of pairs with mutually independent elements, such that the union of those sets covers all inter-class edges. This will enable us to use a concentration argument that computes the fraction of the inter-class edges that are misclassified. Since the graph operations are permutation invariant, let us assume for simplicity that $C_{0}=\{1,\ldots,\frac{n}{2}\}$ and $C_{1}=\{\frac{n}{2}+1,\ldots,n\}$ for an even number of nodes $n$ . Also, define the function

m(i,l)=\begin{cases}i+l,&i+l\leq\frac{n}{2},\\ i+l-\frac{n}{2},&i+l>\frac{n}{2}.\end{cases}.

We now construct the following sequence of sets for all $l\in\{0,\ldots,\frac{n}{2}-1\}$ :

S_{l}=\{(X_{m(i,l)},X_{i+\frac{n}{2}})\;\text{for all }i\in C_{0}\text{ such that }(m(i,l),i+n/2)\in E\}.

Fix $l\in\{0,\ldots,\frac{n}{2}-1\}$ and observe that the pairs in the set $S_{l}$ are mutually independent. Define a Bernoulli random variable, $\beta_{i}$ , to be the indicator that $(X_{m(i,l)},X_{i+\frac{n}{2}})$ is misclassified. We have that $\mathop{{\bf E}\/}[\beta_{i}]\geq 2\Phi_{\rm c}(\kappa)^{2}$ . Note that the fraction of pairs in the set $S_{l}$ that are misclassified is $\frac{1}{|S_{l}|}\sum_{i:(X_{m(i,l)},X_{i+n/2})\in S_{l}}\beta_{i}$ , which is a sum of independent Bernoulli random variables. Hence, by the additive Chernoff bound, we obtain

\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2|S_{l}|\Phi_{\rm c}(\kappa)^{2}-|S_{l}|t\right]\geq 1-\exp(-2|S_{l}|t^{2}).

Since $p,q=\Omega(\frac{\log^{2}n}{n})$ , we have by the Chernoff bound and a union bound that with probability at least $1-1/{\rm poly}(n)$ , $|S_{l}|=nq(1\pm o(1))$ for all $l$ . We now choose $t=\sqrt{\frac{C\log n}{|S_{l}|}}=o(1)$ to obtain that on the event where $|S_{l}|=nq(1\pm o(1))$ , we have the following for any large $C>1$ :

\mathop{{\bf Pr}\/}\left[\frac{1}{|S_{l}|}\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2\Phi_{\rm c}(\kappa)^{2}-o(1)\right]\geq 1-n^{-C}.

Following a union bound over all $l\in\{0,\ldots,\frac{n}{2}-1\}$ , we conclude that for any $c>0$ ,

\mathop{{\bf Pr}\/}\left[\frac{1}{|S_{l}|}\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2\Phi_{\rm c}(\kappa)^{2}-o(1),\;\;\forall l\in\left\{0,\ldots,\frac{n}{2}-1\right\}\right]\geq 1-O(n^{-c}).

Thus, out of all the pairs $\mathbf{X}^{\prime}_{ij}$ with $j\nsim i$ , with probability at least $1-o(1)$ , we have that at least a fraction $2\Phi_{\rm c}(\kappa)^{2}$ of the pairs are misclassified by the attention mechanism. This concludes part 1 of the theorem.

For part 2, note that by the additive Chernoff bound we have for any $t\in(0,1)$ ,

\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2|S_{l}|\Phi_{\rm c}(\kappa)^{2}-|S_{l}|t\right]\geq 1-\exp(-2|S_{l}|t^{2}).

Since $|S_{l}|=\frac{nq}{2}(1\pm o(1))$ with probability at least $1/{\rm poly}(n)$ , we choose $t=2\sqrt{\frac{K\Phi_{\rm c}(\kappa)^{2}\log^{2}n}{nq}}$ to obtain

\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq nq\Phi_{\rm c}(\kappa)^{2}(1\pm o(1))-\sqrt{Knq\Phi_{\rm c}(\kappa)^{2}\log^{2}n}\right]\geq 1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n}).

Now note that if $q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}}$ then we have $nq\Phi_{\rm c}(\kappa)^{2}>K\log^{2}n$ , which implies that

nq\Phi_{\rm c}(\kappa)^{2}-\sqrt{Knq\Phi_{\rm c}(\kappa)^{2}\log^{2}n}>0.

Hence, in this regime of $q$ ,

\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}>0\right]\geq 1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n}),

and the proof is complete.

A.6 Proof of Theorem 10

We restate Theorem 10 for convenience

Theorem.

1.

There is a subset $J_{i,0}\subseteq N_{i}\cap C_{0}$ with cardinality at least $\frac{9}{10}|N_{i}\cap C_{0}|$ , such that $\gamma_{ij}=\Theta(1/|N_{i}|)$ for all $j\in J_{i,0}$ .
2.

There is a subset $J_{i,1}\subseteq N_{i}\cap C_{1}$ with cardinality at least $\frac{9}{10}|N_{i}\cap C_{1}|$ , such that $\gamma_{ij}=\Theta(1/|N_{i}|)$ for all $j\in J_{i,1}$ .

For $i\in[n]$ let us write $\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ where $\boldsymbol{g}_{i}\sim N(0,\mathbf{I})$ , ${\epsilon}_{i}=0$ if $i\in C_{0}$ and ${\epsilon}_{i}=1$ if $i\in C_{1}$ . Moreover, since the parameters $(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R}$ are bounded, we can write $\boldsymbol{w}=R\hat{\boldsymbol{w}}$ and $\boldsymbol{a}=R^{\prime}\hat{\boldsymbol{a}}$ such that $\|\hat{\boldsymbol{w}}\|=1$ and $\|\hat{\boldsymbol{a}}\|=1$ and $R,R^{\prime}$ are some constants. We define the following sets which will become useful later in our computation of $\gamma_{ij}$ ’s. Define

\mathcal{A}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{i\in[n]~{}\bigg{|}~{}\begin{array}[]{l}|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))},\ \mbox{and}\\ |\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq 10\sqrt{\log(n(p+q))},\ \forall j\in N_{i}\end{array}\right\}.

For $i\in[n]$ define

	$\displaystyle J_{i,0}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{0}~{}\|~{}\|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\|\leq\sqrt{10}\right\},$
	$\displaystyle J_{i,1}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{1}~{}\|~{}\|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\|\leq\sqrt{10}\right\},$
	$\displaystyle B_{i,0}^{t}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{0}~{}\|~{}2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\leq 2^{t}\right\},\ t=1,2,\ldots,T,$
	$\displaystyle B_{i,1}^{t}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{1}~{}\|~{}2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\leq 2^{t}\right\},\ t=1,2,\ldots,T,$

where $T\stackrel{{\scriptstyle\rm def}}{{=}}\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil$ .

We start with a few claims about the sizes of these sets.

Claim 20.

With probability at least $1-o(1)$ , we have that $|\mathcal{A}|\geq n(1-o(1))$ .

Proof: Because $|\hat{\boldsymbol{a}}_{2}|\leq 1$ we know that $\mathcal{A}$ is a superset of $\mathcal{A}^{\prime}$ where

\mathcal{A}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{i\in[n]~{}\bigg{|}~{}\begin{array}[]{l}|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))},\ \mbox{and}\\ |\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq 10\sqrt{\log(n(p+q))},\ \forall j\in N_{i}\end{array}\right\}.

We give a lower bound for $|\mathcal{A}^{\prime}|$ and hence prove the result. First of all, note that if $p+q\geq\Omega(1/\log^{2}n)$ , then $\log(n(p+q))=\log n(1-o(1))$ and we easily get that with probability at least $1-o(1)$ , $|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))}$ for all $i\in[n]$ , and thus $|\mathcal{A}|=|\mathcal{A}^{\prime}|=n$ . Therefore let us assume without loss of generality that $p+q\leq O(1/\log^{2}n)$ . Consider the following sum of indicator random variables

S\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{i\in[n]}{\bf 1}_{\left\{|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\geq 10\sqrt{\log(n(p+q))}\right\}}.

By the multiplicative Chernoff bound, for any $\delta>0$ we have

\operatorname{{\bf Pr}}\left[S\geq nb(1+\delta)\right]\leq\exp\left(-\frac{\delta^{2}}{2+\delta}nb\right)

where $b\stackrel{{\scriptstyle\rm def}}{{=}}\operatorname{{\bf Pr}}(|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\geq 10\sqrt{\log(n(p+q))})$ . Moreover, by the standard upper bound on the Gaussian tail probability (Proposition 2.1.2, [45]) we know that $b<e^{-50\log(n(p+q))}$ . Let us set

\delta\stackrel{{\scriptstyle\rm def}}{{=}}\frac{1}{bn(p+q)\log n}.

Then by the upper bound on $b$ and the assumption that $p,q=\Omega(\log^{2}n/n)$ we know that

\delta\geq\frac{(n(p+q))^{49}}{\log n}\geq\Omega(\log^{97}n)=\omega(1).

It follows that

\frac{\delta^{2}}{2+\delta}nb\geq\Omega(\delta nb)=\Omega\left(\frac{1}{(p+q)\log n}\right)\geq\Omega(\log n).

Therefore, with probability at least $1-o(1)$ we have that

S\leq nb(1+\delta)\leq\frac{n}{(n(p+q))^{50}}+\frac{n}{n(p+q)\log n}=O\left(\frac{n}{n(p+q)\log n}\right).

Apply the concentration result of node degrees, this means that with probability at least $1-o(1)$ ,

	$\displaystyle\left\|\left\{i\in[n]~{}\big{\|}~{}\|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\|\geq 10\sqrt{\log(n(p+q))}\ \mbox{or}\ \exists j\in N_{i}\ \mbox{such that}\ \|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\|\geq 10\sqrt{\log(n(p+q))}\right\}\right\|$
	$\displaystyle\leq\ S\cdot\frac{n}{2}(p+q)(1\pm o(1))=O\left(\frac{n}{n(p+q)\log n}\right)\cdot\frac{n}{2}(p+q)(1\pm o(1))=O\left(\frac{n}{\log n}\right).$

Therefore we have

|\mathcal{A}^{\prime}|\geq n-O(n/\log n)=n(1-o(1)).

Claim 21.

With probability at least $1-o(1)$ , we have that for all $i\in[n]$ ,

|J_{i,0}|\geq\frac{9}{10}|N_{i}\cap C_{0}|~{}~{}\mbox{and}~{}~{}|J_{i,1}|\geq\frac{9}{10}|N_{i}\cap C_{1}|.

Proof: We prove the result for $J_{i,0}$ , the result for $J_{i,1}$ follows analogously. First, fix $i\in[n]$ . For each $j\in|N_{i}\cap C_{0}|$ we have that

\operatorname{{\bf Pr}}[|\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}]\leq\operatorname{{\bf Pr}}[|\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}]\leq e^{-50}.

Denote $J_{i,0}^{c}\stackrel{{\scriptstyle\rm def}}{{=}}(N_{i}\cap C_{0})\setminus J_{i,0}$ . We have that

\mathop{{\bf E}\/}[|J_{i,0}^{c}|]=\mathop{{\bf E}\/}\left[\sum_{j\in N_{i}\cap C_{0}}{\bf 1}_{\left\{|\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}\right\}}\right]\leq e^{-50}|N_{i}\cap C_{0}|,

Apply Chernoff’s inequality (Theorem 2.3.4 in [45]) we have

	$\displaystyle\operatorname{{\bf Pr}}\left[\|J_{i,0}^{c}\|\geq\frac{1}{10}\|N_{i}\cap C_{0}\|\right]$	$\displaystyle\leq e^{-\mathop{{\bf E}\/}[\|J_{i,0}^{c}\|]}\left(\frac{e\mathop{{\bf E}\/}[\|J_{i,0}^{c}\|]}{\|N_{i}\cap C_{0}\|/10}\right)^{\|N_{i}\cap C_{0}\|/10}$
		$\displaystyle\leq\left(\frac{ee^{-50}\|N_{i}\cap C_{0}\|}{\|N_{i}\cap C_{0}\|/10}\right)^{\|N_{i}\cap C_{0}\|/10}$
		$\displaystyle=\exp\left(-\left(\frac{1}{2}-\frac{\log 10}{10}-\frac{1}{10}\right)\|N_{i}\cap C_{0}\|\right)$
		$\displaystyle\leq\exp\left(-\frac{4}{25}\|N_{i}\cap C_{0}\|\right).$

Apply the union bound we get

	$\displaystyle\operatorname{{\bf Pr}}\left[\|J_{i,0}\|\geq\frac{9}{10}\|C_{0}\cap N_{i}\|,\forall i\in[n]\right]$	$\displaystyle\geq 1-\sum_{i\in[n]}\exp\left(-\frac{4}{25}\|N_{i}\cap C_{0}\|\right)$
		$\displaystyle\geq\operatorname{{\bf Pr}}(\boldsymbol{\mathcal{E}}_{3})\cdot\left(1-\sum_{i\in[n]}\exp\left(-\frac{4}{25}\frac{n\min(p,q)(1-o(1))}{2}\right)\right)$
		$\displaystyle=(1-o(1))\cdot\left(1-n\exp\left(-\frac{2n\min(p,q)(1-o(1))}{25}\right)\right)$
		$\displaystyle=1-o(1).$

The second inequality follows because $|N_{i}\cap C_{0}|\geq\frac{n}{2}\min(p,q)(1-o(1))$ under the event $\boldsymbol{\mathcal{E}}_{3}$ (cf. Definition 17) for all $i\in[n]$ . The last equality is due to our assumption that $p,q=\Omega(\frac{\log^{2}n}{n})$ .

Claim 22.

With probability at least $1-o(1)$ , we have that for all $i\in[n]$ and for all $t\in[T]$ ,

|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}~{}~{}\mbox{and}~{}~{}|B_{i,1}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,1}^{t}|]+\sqrt{T}|N_{i}\cap C_{1}|^{\frac{4}{5}}.

Proof: We prove the result for $B_{i,0}^{t}$ , and the result for $B_{i,1}^{t}$ follows analogously. First fix $i\in[n]$ and $t\in[T]$ . By the additive Chernoff inequality, we have

\operatorname{{\bf Pr}}\left(|B_{i,0}^{t}|\geq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+|N_{i}\cap C_{0}|\cdot\sqrt{T}|N_{i}\cap C_{0}|^{-\frac{1}{5}}\right)\leq e^{-2T|N_{i}\cap C_{0}|^{3/5}}.

Taking a union bound over all $i\in[n]$ and $t\in[T]$ we get

		$\displaystyle\operatorname{{\bf Pr}}\left[\bigcup_{i\in[n]}\bigcup_{t\in[T]}\left\{\|B_{i,0}^{t}\|\geq\mathop{{\bf E}\/}[\|B_{i,0}^{t}\|]+\sqrt{T}\|N_{i}\cap C_{0}\|^{\frac{4}{5}}\right\}\right]$
	$\displaystyle\leq~{}$	$\displaystyle nT\exp\left(-2T\left(\frac{n}{2}\min(p,q)(1-o(1))\right)^{3/5}\right)+o(1)~{}=~{}o(1),$

where the last equality follows from Assumption 1 that $p,q=\Omega(\frac{\log^{2}n}{n})$ , and hence

\displaystyle nT\exp\left(-2T\left(\frac{n}{2}\min(p,q)(1-o(1))\right)^{3/5}\right)

\displaystyle=nT\exp\left(-\omega\left(\sqrt{2}T\log n\right)\right)=O\left(n^{-c}\right)

for some absolute constant $c>0$ . Moreover, we have used degree concentration, which introduced the additional additive $o(1)$ term in the probability upper bound. Therefore we have

\operatorname{{\bf Pr}}\left[|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}},\forall i\in[n]~{}\forall t\in[T]\right]\geq 1-o(1).

We start by defining an event $\boldsymbol{\mathcal{E}}^{\#}$ which is the intersection of the following events over the randomness of $\mathbf{A}$ and $\{{\epsilon}_{i}\}_{i}$ and $\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ ,

•

$\boldsymbol{\mathcal{E}}_{1}^{\prime}$ is the event that for each $i\in[n]$ , $|C_{0}\cap N_{i}|=\frac{n}{2}((1-{\epsilon}_{i})p+{\epsilon}_{i}q)(1\pm o(1))$ and $|C_{1}\cap N_{i}|=\frac{n}{2}((1-{\epsilon}_{i})q+{\epsilon}_{i}p)(1\pm o(1))$ .
•

$\boldsymbol{\mathcal{E}}_{2}^{\prime}$ is the event that $|\mathcal{A}|\geq n-o(\sqrt{n})$ .
•

$\boldsymbol{\mathcal{E}}_{3}^{\prime}$ is the event that $|J_{i,0}|\geq\frac{9}{10}|N_{i}\cap C_{0}|$ and $|J_{i,1}|\geq\frac{9}{10}|N_{i}\cap C_{1}|$ for all $i\in[n]$ .
•

$\boldsymbol{\mathcal{E}}_{4}^{\prime}$ is the event that $|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}$ and $|B_{i,1}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,1}^{t}|]+\sqrt{T}|N_{i}\cap C_{1}|^{\frac{4}{5}}$ for all $i\in[n]$ and for all $t\in[T]$ .

By Claims 20, 21, 22, we get that with probability at least $1-o(1)$ , the event $\boldsymbol{\mathcal{E}}^{\#}\stackrel{{\scriptstyle\rm def}}{{=}}\bigcap_{i=1}^{4}\boldsymbol{\mathcal{E}}_{i}^{\prime}$ holds. We will show that under event $\boldsymbol{\mathcal{E}}^{\#}$ , for all $i\in\mathcal{A}$ , for all $j\in J_{i,c}$ where $c\in\{0,1\}$ , we have $\gamma_{ij}=\Theta(1/|N_{i}|)$ . This will prove Theorem 10.

Fix $i\in\mathcal{A}$ and some $j\in J_{i,0}$ . Let us consider

	$\displaystyle\gamma_{ij}$	$\displaystyle=\frac{\exp\left(\mathrm{LeakyRelu}(\boldsymbol{a}_{1}\boldsymbol{w}^{T}\mathbf{X}_{i}+\boldsymbol{a}_{2}\boldsymbol{w}^{T}\mathbf{X}_{j}+b)\right)}{\sum_{k\in N_{i}}\exp\left(\mathrm{LeakyRelu}(\boldsymbol{a}_{1}\boldsymbol{w}^{T}\mathbf{X}_{i}+\boldsymbol{a}_{2}\boldsymbol{w}^{T}\mathbf{X}_{k}+b)\right)}$
		$\displaystyle=\frac{\exp\left(\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\right)}{\sum_{k\in N_{i}}\exp\left(\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})\right)}$
		$\displaystyle=\frac{1}{\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})}$

where for $l\in N_{i}$ , we denote

	$\displaystyle\kappa_{il}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}(2{\epsilon}_{i}-1)\hat{\boldsymbol{w}}^{T}\boldsymbol{\mu}/\sigma+(2{\epsilon}_{l}-1)\hat{\boldsymbol{w}}^{T}\boldsymbol{\mu}/\sigma,$
	$\displaystyle\Delta_{il}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{il}+\hat{\boldsymbol{a}}_{1}\boldsymbol{w}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{l}+b^{\prime}),$

and $b=\sigma RR^{\prime}b^{\prime}$ . We will show that

\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\Theta(|N_{i}|)

and hence conclude that $\gamma_{ij}=\Theta(1/|N_{i}|)$ . First of all, note that since $\|\boldsymbol{\mu}\|\leq K\sigma$ for some absolute constant $K$ , we know that

|\kappa_{il}|\leq\sqrt{2}K=O(1).

Let us assume that $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0$ and consider the following two cases regarding the magnitude of $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}$ .

Case 1. If $\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0$ , then

	$\displaystyle\Delta_{ik}-\Delta_{ij}$	$\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})$
		$\displaystyle\qquad-\mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}$
		$\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1))$
		$\displaystyle\qquad-\beta(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}$
		$\displaystyle=\sigma RR^{\prime}\left(\mathrm{LeakyRelu}(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1))\pm O(1)\right)$
		$\displaystyle=\sigma RR^{\prime}\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),$

where $\beta$ is the slope of $\mathrm{LeakyRelu}(x)$ for $x<0$ . Here, the second equality follows from $|\kappa_{ik}+b^{\prime}|\leq\sqrt{2}K+|b^{\prime}|=O(1)$ and $\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0$ . The third equality follows from

•

We have $j\in J_{i,0}$ and hence $|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|=O(1)$ ;
•

We have $\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0$ , so $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<|\kappa_{ij}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|+|b^{\prime}|=O(1)$ , moreover, because $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0$ , we get that $|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|=O(1)$ ;
•

We have $|\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}|\leq|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|+|\kappa_{ij}+b^{\prime}|=O(1)+O(1)+O(1)=O(1)$ .

Case 2. If $\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}\geq 0$ , then

	$\displaystyle\Delta_{ik}-\Delta_{ij}$	$\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})$
		$\displaystyle\qquad-\mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}$
		$\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})$
		$\displaystyle\qquad-\kappa_{ij}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}-\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}-b^{\prime}\Big{)}$
		$\displaystyle=\sigma RR^{\prime}\left(\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)\right)$
		$\displaystyle\left\{\begin{array}[]{ll}=\sigma RR^{\prime}\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),&\mbox{if}\ k\in J_{i,0}\cup J_{i,1}\\ \leq\sigma RR^{\prime}\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),&\mbox{otherwise}.\end{array}\right.$

To see the last (in)equality in the above, consider the following cases:

If $k\in J_{i,0}\cup J_{i,1}$ , then there are two cases depending on the sign of $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}$ .

•

If $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}\geq 0$ , then we have that

		$\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\kappa_{ik}+b^{\prime}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1).$

•

If $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}<0$ , then because $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0$ and $|\kappa_{ik}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}|\leq|\kappa_{ik}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|+|b^{\prime}|=O(1)$ , we know that $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<|\kappa_{ik}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|+|b^{\prime}|=O(1)$ and $|\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}|=O(1)$ . Therefore it follows that

		$\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\mathrm{LeakyRelu}(\pm O(1))-O(1)\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1)$

where the last equality is due to the fact that $k\in J_{i,0}\cup J_{i,1}$ so $|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|=O(1)$ .

If $k\not\in J_{i,0}\cup J_{i,1}$ , then there are two cases depending on the sign of $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}$ .

•

If $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}\geq 0$ , then we have that

		$\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\kappa_{ik}+b^{\prime}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1).$

•

If $\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}<0$ , then we have that,

		$\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\beta\kappa_{ik}+\beta\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\beta b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle=~{}$	$\displaystyle\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}-(1-\beta)\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)$
	$\displaystyle\leq~{}$	$\displaystyle\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1),$

where $\beta$ is the slope of $\mathrm{LeakyRelu}(\cdot)$ .

Combining the two cases regarding the magnitude of $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}$ and our assumption that $\sigma,R,R=O(1)$ , so far we have showed that, for any $i$ such that $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0$ , for all $j\in J_{i,0}$ , we have

\Delta_{ik}-\Delta_{ij}=\left\{\begin{array}[]{ll}\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1),&\mbox{if}\ k\in J_{i,0}\cup J_{i,1}\\ O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1),&\mbox{otherwise}.\end{array}\right.

(16)

By following a similar argument, one can show that Equation 16 holds for any $i$ such that $\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<0$ .

Let us now compute

\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})+\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij})

for some $j\in J_{i,0}$ . Let us focus on $\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})$ first. We will show that $\Omega(|N_{i}\cap C_{0}|)\leq\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|)$ .

First of all, we have that

\begin{split}\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})&\geq\sum_{k\in J_{i,0}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in J_{i,0}}\exp\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\\ &\geq\sum_{k\in J_{i,0}}e^{c_{1}}=|J_{i,0}|e^{c_{1}}=\Omega(|N_{i}\cap C_{0}|),\end{split}

(17)

where $c_{1}$ is an absolute constant (possibly negative). On the other hand, consider the following partition of $N_{i}\cap C_{0}$ :

	$\displaystyle P_{1}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{k\in N_{i}\cap C_{0}~{}\|~{}\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\leq 1\},$
	$\displaystyle P_{2}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{k\in N_{i}\cap C_{0}~{}\|~{}\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 1\}.$

It is easy to see that

\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq\sum_{k\in P_{1}}\exp\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\leq\sum_{k\in P_{1}}e^{c_{2}}=|P_{1}|e^{c_{2}}=O(|N_{i}\cap C_{0}|),

(18)

where $c_{2}$ is an absolute constant. Moreover, because $i\in\mathcal{A}$ we have that $P_{2}\subseteq\bigcup_{t\in[T]}B_{i,0}^{t}$ . It follows that

\begin{split}\sum_{k\in P_{2}}\exp(\Delta_{ik}-\Delta_{ij})&=\sum_{t\in[T]}\sum_{k\in B_{i,0}^{t}}\exp(\Delta_{ik}-\Delta_{ij})\\ &\leq\sum_{t\in[T]}\sum_{k\in B_{i,0}^{t}}\exp\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\\ &\leq\sum_{t\in[T]}|B_{i,0}^{t}|e^{c_{3}2^{t}},\end{split}

(19)

where $c_{3}$ is an absolute constant. We can upper bound the above quantity as follows. Under the Event $\boldsymbol{\mathcal{E}}^{*}$ , we have that

|B_{i,0}^{t}|\leq m_{t}+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}},\ \mbox{for all}\ t\in[T],

where

	$\displaystyle m_{t}\stackrel{{\scriptstyle\rm def}}{{=}}\mathop{{\bf E}\/}[\|B_{i,0}^{t}\|]$	$\displaystyle=\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}(2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\leq 2^{t})\leq\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}[\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 2^{t-1}]$
		$\displaystyle\leq\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}[\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 2^{t-1}]\leq\|N_{i}\cap C_{0}\|e^{-2^{2t-3}}.$

It follows that

\begin{split}\sum_{t\in[T]}|B_{i,0}^{t}|e^{c_{3}2^{t}}&\leq\sum_{t\in[T]}\left(|N_{i}\cap C_{0}|e^{-2^{2t-3}}+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}\right)e^{c_{3}2^{t}}\\ &\leq|N_{i}\cap C_{0}|\sum_{t=1}^{\infty}e^{-2^{2t-3}}e^{c_{3}2^{t}}+\sum_{t\in[T]}\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}\\ &\leq c_{4}|N_{i}\cap C_{0}|+o(|N_{i}|)\\ &\leq O(|N_{i}|),\end{split}

(20)

where $c_{4}$ is an absolute constant. The third inequality in the above follows from

•

The series $\sum_{t=1}^{\infty}e^{-2^{2t-3}}e^{c_{3}2^{t}}$ converges absolutely for any constant $c_{3}$ ;

•

The sum $\sum_{t\in[T]}\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}=T^{\frac{3}{2}}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}=o(|N_{i}|)$ because

	$\displaystyle\log\left(T^{\frac{3}{2}}e^{c_{3}2^{T}}\right)$	$\displaystyle=\frac{3}{2}\log\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil+c_{3}2^{\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil}$
		$\displaystyle\leq\frac{3}{2}\log\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil+20c_{3}\sqrt{\log(n(p+q))}$
		$\displaystyle\leq O\left(\frac{1}{c}\log(n(p+q))\right),$

for any $c>0$ . In particular, by picking $c>5$ we see that $T^{\frac{3}{2}}e^{c_{3}2^{T}}\leq O((n(p+q))^{\frac{1}{c}})\leq o(|N_{i}|^{\frac{1}{5}})$ , and hence we get $T^{\frac{3}{2}}e^{c_{3}2^{T}}|N_{i}\cap C_{0}|^{\frac{4}{5}}\leq|N_{i}|^{\frac{4}{5}}\cdot o(|N_{i}|^{\frac{1}{5}})=o(|N_{i}|)$ .

Combining Equations 19 and 20 we get

\sum_{k\in P_{2}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|),

(21)

and combining Equations 18 and 21 we get

\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})+\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|).

(22)

Now, by Equations 17 and 22 we get

\Omega(|N_{i}\cap C_{0}|)\leq\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|).

(23)

It turns out that repeating the same argument for $\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij})$ yields

\Omega(|N_{i}\cap C_{1}|)\leq\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|).

(24)

Finally, Equations 23 and 24 give us

\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\Theta(|N_{i}|),

which readily implies

\gamma_{ij}=\frac{1}{\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})}=\Theta(1/|N_{i}|)

as required. We have showed that for all $i\in\mathcal{A}$ and for all $j\in J_{i,0}$ , $\gamma_{ij}=\Theta(1/|N_{i}|)$ . Repeating the same argument we get that the same result holds for all $i\in\mathcal{A}$ and for all $j\in J_{i,1}$ , too. Hence, by Claims 20 and 21 about the cardinalities of $\mathcal{A}$ , $J_{i,0}$ and $J_{i,1}$ we have thus proved Theorem 10.

A.7 Proof of Proposition 13

We restate Proposition 13 for convenience.

Proposition 23.

1.

$t=O(1)$ and $\|\boldsymbol{\mu}\|\leq M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|}$ ;
2.

$t=\omega(1)$ and $\|\boldsymbol{\mu}\|\leq M^{\prime}\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}$ .

We start with part 1 of the proposition. Let us assume that $p\geq q$ . The result when $p<q$ follows analogously. We will condition on the events $\boldsymbol{\mathcal{E}}_{1},\boldsymbol{\mathcal{E}}_{2},\boldsymbol{\mathcal{E}}_{3}$ defined in Definition 17. These events are concerned with the concentration of class sizes $|C_{0}|$ and $|C_{1}|$ and the concentration of the number of intra-class and inter-class edges, i.e. $|C_{0}\cap N_{i}|$ and $C_{1}\cap N_{i}|$ for all $i$ . By Lemma 18, the probability that these events hold simultaneously is at least $1-o(1)$ . Fix any $\boldsymbol{w}\in\mathbb{R}^{d}$ such that $\|\boldsymbol{w}\|=1$ . Without loss of generality, assume that $\boldsymbol{w}^{T}\boldsymbol{\mu}>0$ . Because $t=O(1)$ , by the definition of $\tilde{\Psi}$ in (13) and the attention coefficients in (1) we have that

\gamma_{ij}=\left\{\begin{array}[]{ll}\frac{c_{1}}{n(p+q)}(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ \frac{c_{2}}{n(p+q)}(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right.

(25)

for some positive constants $c_{1}\geq 1$ and $c_{2}\leq 1$ . Let us write $\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ where $\boldsymbol{g}_{i}\sim N(0,\mathbf{I})$ , ${\epsilon}_{i}=0$ if $i\in C_{0}$ and ${\epsilon}_{i}=1$ if $i\in C_{1}$ . Let us consider the classification of nodes in $C_{0}$ based on the model output $\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}X_{j}$ for node $i$ . Note that, depending on whether $i\in C_{0}$ or $i\in C_{1}$ , the expectation of the model output is symmetric around 0, and therefore we consider the decision boundary at 0. Let $0<\epsilon<1$ and fix any partition of $C_{0}$ into a set of disjoint subsets,

C_{0}=\bigcup_{h=1}^{\ell}C_{0}^{(h)}

such that $\ell=|C_{0}|^{1-\epsilon}$ and $|C_{0}^{(h)}|=|C_{0}|^{\epsilon}$ for every $h=1,2,\ldots,\ell$ . In what follows we will consider the classification of nodes in each $C_{0}^{(h)}$ separately. We will show that, with probability at least $1-o(1)$ , for each one of more than half of the subsets $C_{0}^{(h)}$ where $h=1,2,\ldots,\ell$ , the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least $\ell/2=|C_{0}|^{1-\epsilon}/2=\Omega(n^{1-\epsilon})$ nodes, giving the required result on misclassification rate.

Fix any $h^{\prime}\in\{1,2,\ldots,\ell\}$ . Using (25) we get that, for large enough $n$ , the event that the model correctly classifies all nodes in $C_{0}^{(h)}$ satisfies

\begin{split}&\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{\}}\\ =~{}&\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\bigg{(}\sum_{j\in N_{i}\cap C_{1}}\gamma_{ij}-\sum_{j\in N_{i}\cap C_{0}}\gamma_{ij}\bigg{)}\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\\ \subseteq~{}&\Bigg{\{}c_{3}\left(\frac{q-p}{p+q}\right)\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\end{split}

for some absolute constant $c_{3}>0$ , and hence the probability that the model correctly classifies all nodes in $C_{0}^{(h^{\prime})}$ satisfies, for large enough $n$ ,

	$\displaystyle\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{)}$	$\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{3}\left(\frac{p-q}{p+q}\right)\frac{\|\boldsymbol{w}^{T}\boldsymbol{\mu}\|}{\sigma}\Bigg{)}$
		$\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\Bigg{)}$

where the last inequality follows from our assumption on $\|\boldsymbol{\mu}\|$ and we denote $\tilde{M}\stackrel{{\scriptstyle\rm def}}{{=}}c_{3}M>0$ . Now we will use Sudakov’s minoration inequality [45] to obtain a lower bound on the expected maximum, and then apply Borell’s inequality to upper bound the above probability. In order to apply Sudakov’s result we will need to define a metric over the node index set $[n]$ . Let $\boldsymbol{z}_{i}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}$ . For $i,j\in C_{0}$ , $i\neq j$ , consider the canonical metric $d_{\circ}(i,j)$ given by

	$\displaystyle d_{\circ}(i,j)^{2}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\mathop{{\bf E}\/}[(\boldsymbol{z}_{i}-\boldsymbol{z}_{j})^{2}]$
		$\displaystyle=\sum_{k\in N_{i}}\gamma_{ik}^{2}+\sum_{k\in N_{j}}\gamma_{jk}^{2}-2\sum_{k\in N_{i}\cap N_{j}}\gamma_{ik}\gamma_{jk}$
		$\displaystyle\geq c_{4}\sum_{k\in J_{ij}}\frac{1}{n^{2}(p+q)^{2}}$
		$\displaystyle=\frac{c_{4}\|J_{ij}\|}{n^{2}(p+q)^{2}},$

where $J_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}(N_{i}\cup N_{j})\backslash(N_{i}\cap N_{j})$ is the symmetric difference of the neighbors of $i$ and $j$ , $c_{4}>0$ is an absolute constant, and the inequality is due to (25). We lower bound $|J_{ij}|$ as follows. For $i,j\in C_{0}$ , $i\neq j$ , and a node $k\in[n]$ , the probability that $k$ is a neighbor of exactly one of $i$ and $j$ is $2p(1-p)$ if $k\in C_{0}$ and $2q(1-q)$ if $k\in C_{1}$ . Therefore we have $\mathop{{\bf E}\/}[|J_{ij}|]=n(p(1-p)+q(1-q))$ . It follows from the multiplicative Chernoff bound that for any $0<\delta<1$ ,

\operatorname{{\bf Pr}}[|J_{ij}|<\mathop{{\bf E}\/}[|J_{ij}|](1-\delta)]\leq\exp(-\delta^{2}\mathop{{\bf E}\/}[|J_{ij}|]/3).

Choose

\delta=3\sqrt{\frac{\log n}{\mathop{{\bf E}\/}[|J_{ij}|]}}=3\sqrt{\frac{\log n}{n(p(1-p)+q(1-q))}}=o(1),

where the last equality follows from $n(p(1-p)+q(1-q))=\Omega(\log^{2}n)$ , which is in turn due to the assumptions that $p,q=\Omega(\log^{2}n/n)$ and $p,q$ are bounded away from 1. Apply a union bound over all $i,j\in C_{0}$ , we get that with probability at least $1-o(1)$ , the size of $J_{ij}$ satisfies

|J_{ij}|\geq n(p(1-p)+q(1-q))(1-o(1)),\forall i,j\in C_{0}.

(26)

Therefore it follows that, for large enough $n$ ,

d_{\circ}(i,j)\geq\sqrt{\frac{c_{4}|J_{ij}|}{n^{2}(p+q)^{2}}}=\sqrt{\frac{c_{4}n(p(1-p)+q(1-q))(1-o(1))}{n^{2}(p+q)^{2}}}=\Omega\left(\sqrt{\frac{1-\max(p,q)}{n(p+q)}}\right).

We condition on the event that the inequality (26) holds for all $i,j\in C_{0}$ , which happens with probability at least $1-o(1)$ . Apply Sudakov’s minoration with metric $d_{\circ}(i,j)$ , we get that for large enough $n$ , for all $h=1,2,\ldots,\ell$ ,

\mathop{{\bf E}\/}\left[\max_{i\in C_{0}^{(h)}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\right]\geq c_{5}d_{\circ}(i,j)\sqrt{\log|C_{0}^{(h)}|}\geq c_{6}\sqrt{\epsilon}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}

(27)

for some absolute constants $c_{5},c_{6}>0$ . The last inequality in the above follows from $|C_{0}^{(h)}|=|C_{0}|^{\epsilon}=\Omega(n^{\epsilon})$ . In addition, note that since by assumption $\Psi$ is independent from the node features, using (25) we have that $\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}$ is Gaussian with variance $O(\frac{1}{n(p+q)})$ . We use Borell’s inequality ([2] Chapter 2) to get that for any $t>0$ and large enough $n$ ,

\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\mathop{{\bf E}\/}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\right]-t\right]\leq\exp(-c_{7}t^{2}n(p+q)).

for some absolute constant $c_{7}>0$ . By the lower bound of the expectation (27), we get that

\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{6}\sqrt{\epsilon}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}-t\right]\leq\exp(-c_{7}t^{2}n(p+q)).

Now, let $M>0$ be any constant that also satisfies $M<c_{6}\sqrt{\epsilon}/c_{3}$ . Recall that we defined $\tilde{M}=c_{3}M$ , and hence $\tilde{M}<c_{6}\sqrt{\epsilon}$ . Set

t=(c_{6}\sqrt{\epsilon}-\tilde{M})\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}=\Omega\left(\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\right),

(28)

then combine with the events we have conditioned so far we get

\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\leq\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\right]=o(1).

Recall that the above probability is the probability of correctly classifying all nodes in $C_{0}^{(h^{\prime})}$ . Since our choice of $h^{\prime}$ was arbitrary, this applies to every $h\in\{1,2,\ldots,h\}$ . Let $I_{0}^{(h)}$ denote the indicator variable of the event that there is at least one node in $C_{0}^{(h)}$ that is misclassified, then

\mathop{{\bf E}\/}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\right]=\ell\cdot\mathop{{\bf E}\/}[I_{0}^{(1)}]=\ell\cdot(1-o(1))=\ell-o(\ell).

Apply the reverse Markov inequality, we get

\operatorname{{\bf Pr}}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\leq\frac{1}{2}\ell\right]\leq\frac{\ell-\mathop{{\bf E}\/}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\right]}{\ell-\frac{1}{2}\ell}=\frac{o(\ell)}{\frac{1}{2}\ell}=o(1).

Therefore, with probability at least $1-o(1)$ , we have $\sum_{h=1}^{\ell}I_{0}^{(h)}\geq\frac{1}{2}\ell=\Omega(n^{1-\epsilon})$ . This implies the required result that the model misclassifies at least $\Omega(n^{1-\epsilon})$ nodes.

The proof of part 2 is similar to the proof of part 1. Let us assume that $p\geq q$ since the result when $p<q$ can be proved analogously. We condition on the events $\boldsymbol{\mathcal{E}}_{1},\boldsymbol{\mathcal{E}}_{2},\boldsymbol{\mathcal{E}}_{3}$ defined in Definition 17 which simultaneous hold with probability at least $1-o(1)$ by Lemma 18. Fix any $\boldsymbol{w}\in\mathbb{R}^{d}$ such that $\|\boldsymbol{w}\|=1$ . Because $t=\omega(1)$ , by the definition of $\tilde{\Psi}$ in (13) and the attention coefficients in (1) we have that

\gamma_{ij}=\left\{\begin{array}[]{ll}\frac{2}{np}(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ o(\frac{1}{n(p+q)}),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

(29)

Write $\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i}$ where $\boldsymbol{g}_{i}\sim N(0,\mathbf{I})$ , ${\epsilon}_{i}=0$ if $i\in C_{0}$ and ${\epsilon}_{i}=1$ if $i\in C_{1}$ . We consider the classification of nodes in $C_{0}$ based on the model output $\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}X_{j}$ for node $i$ and the decision boundary at 0. As before, let $0<\epsilon<1$ and fix any partition of $C_{0}$ into a set of disjoint subsets $C_{0}=\bigcup_{h=1}^{\ell}C_{0}^{(h)}$ such that $\ell=|C_{0}|^{1-\epsilon}$ and $|C_{0}^{(h)}|=|C_{0}|^{\epsilon}$ for every $h=1,2,\ldots,\ell$ . We proceed to show that, with high probability, for each one of more than half of the subsets $C_{0}^{(h)}$ where $h=1,2,\ldots,\ell$ , the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least $\ell/2=\Omega(n^{1-\epsilon})$ nodes as required. Fix any $h^{\prime}\in\{1,2,\ldots,\ell\}$ . Using (29) we get that, for large enough $n$ , the event that the model correctly classifies all nodes in $C_{0}^{(h^{\prime})}$ satisfies

\begin{split}\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{\}}\subseteq\Bigg{\{}c_{1}\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\end{split}

for some absolute constant $c_{1}>0$ , and hence the probability that the model classifies all nodes in $C_{0}^{h^{\prime}}$ correctly satisfies, for large enough $n$ ,

	$\displaystyle\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{)}$	$\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{1}\frac{\|\boldsymbol{w}^{T}\boldsymbol{\mu}\|}{\sigma}\Bigg{)}$
		$\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\Bigg{)}$

where the last inequality follows from our assumption on $\|\boldsymbol{\mu}\|$ and we denote $\tilde{M}\stackrel{{\scriptstyle\rm def}}{{=}}c_{1}M^{\prime}>0$ . The rest of the proof of part 2 proceeds as the proof of part 1.

A.8 Proof of Theorem 14

We restate Theorem 14 for convenience. Recall that we write $\|\boldsymbol{\mu}\|=\kappa\sigma$ for some $\kappa>0$ .

Theorem.

	$\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,\;\forall i\in[n],$
	$\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0,\;\forall i\in[n].$

We will condition on the following two events concerning the values of $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}$ for $i\in[n]$ . Both events hold with probability at least $1-o(1)$ . First, for $\epsilon>0$ consider the event

\left\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\not\in[-\epsilon\sigma\max\{1,\kappa\},\epsilon\sigma\max\{1,\kappa\}]\ \mbox{for all}\ i\in[n]\right\}.

Since $\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|$ and $\|\boldsymbol{\mu}\|=\kappa\sigma$ , each $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}$ follows a normal distribution with mean $\epsilon_{i}\kappa\sigma$ and variance $\sigma^{2}$ (recall that $\epsilon_{i}\in\{0,1\}$ generates the node memberships in CSBM), we have that

	$\displaystyle\operatorname{{\bf Pr}}\left(-\epsilon\sigma\max\{1,\kappa\}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq\epsilon\sigma\max\{1,\kappa\}\right)$	$\displaystyle=\int_{\kappa-\epsilon\max\{1,\kappa\}}^{\kappa+\epsilon\max\{1,\kappa\}}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}dx$
		$\displaystyle\leq\left\{\begin{array}[]{ll}\sqrt{\frac{2}{\pi}}\epsilon,&\mbox{if}\ \kappa\leq 1,\\ \sqrt{\frac{2}{\pi}}e^{-\kappa^{2}(1-\epsilon)^{2}/2}\epsilon\kappa,&\mbox{if}\ \kappa>1.\end{array}\right.$

Let us pick $\epsilon$ such that

\displaystyle\epsilon=\frac{1}{n\sqrt{\log n}}\;\ \mbox{if}\;\ \kappa\leq 2\sqrt{\log n},\quad\mbox{and}\quad\epsilon=\frac{1}{4}\;\ \mbox{if}\;\ \kappa>2\sqrt{\log n}.

This gives that

\operatorname{{\bf Pr}}\left(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\not\in[-\epsilon\sigma\max\{1,\kappa\},\epsilon\sigma\max\{1,\kappa\}]\ \mbox{for all}\ i\in[n]\right)\geq 1-n\cdot o(1/n)=1-o(1).

Second, consider the event that $|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}]|\leq 10\sigma\sqrt{\log n}$ . By Lemma 18 we know that it holds with probability at least $1-o(1)$ . Under these two events, we have that

	$\displaystyle\epsilon\sigma\max\{1,\kappa\}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq\kappa\sigma+10\sigma\sqrt{\log n}\quad$	$\displaystyle\mbox{if}\quad\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,$
	$\displaystyle-\kappa\sigma-10\sigma\sqrt{\log n}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq-\epsilon\sigma\max\{1,\kappa\}\quad$	$\displaystyle\mbox{if}\quad\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0.$

Recall from (5) that

\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

Consider an arbitrary node $i$ where $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0$ . We will show that $h_{i}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0$ . By the definition of attention coefficients in (1), we know that

\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\iff\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0.

Using the above expression for $\Psi$ , we have that

\begin{split}&\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\\ =~{}&\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{i}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}+\sum_{\begin{subarray}{c}j\in N_{i}\\ j\neq i\end{subarray}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\\ \geq~{}&\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{i}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-(n-1)\exp(\Psi(\mathbf{X}_{i},-\mathbf{X}_{i}))(\kappa\sigma+10\sigma\sqrt{\log n})\\ \geq~{}&e^{2R(1-\beta)\epsilon\sigma\max\{1,\kappa\}}\epsilon\sigma\max\{1,\kappa\}-(n-1)e^{-2R(1-\beta)\epsilon\sigma\max\{1,\kappa\}}(\kappa\sigma+10\sigma\sqrt{\log n}),\end{split}

(30)

where the second last inequality follows from the event that $|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\leq\kappa\sigma+10\sigma\sqrt{\log n}$ for all $j$ , and $|N_{i}|\leq n$ , and the last inequality follows from the fact that the function

f(x)=xe^{2R(1-\beta)x}-(n-1)(\kappa\sigma+10\sigma\sqrt{\log n})e^{-2R(1-\beta)x}

is increasing with respect to $x$ for $x\geq 0$ . To see that $h_{i}^{\prime}=\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0$ , consider the following cases separately.

•

If $\kappa\leq 2\sqrt{\log n}$ , recall that we picked $\epsilon=1/(n\log n)$ , then because $R=\Omega(n\log^{2}n/\sigma)$ , for large enough $n$ we have that

		$\displaystyle 4R(1-\beta)\sigma>\frac{n\sqrt{\log n}}{\max\{1,\kappa\}}\left(\log n+\log\left(\frac{n\sqrt{\log n}}{\max\{1,\kappa\}}\right)+\log(\kappa+10\sqrt{\log n})\right)$
	$\displaystyle\iff$	$\displaystyle 4R(1-\beta)\sigma>\frac{1}{\epsilon\max\{1,\kappa\}}\left(\log n+\log\left(\frac{1}{\epsilon\max\{1,\kappa\}}\right)+\log(\kappa+10\sqrt{\log n})\right)$
	$\displaystyle\iff$	$\displaystyle e^{4R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}>\frac{n(\kappa+10\sqrt{\log n})}{\epsilon\max\{1,\kappa\}}$
	$\displaystyle\iff$	$\displaystyle e^{2R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}\sigma\epsilon\max\{1,\kappa\}>ne^{-2R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}(\kappa\sigma+10\sigma\sqrt{\log n}).$

By (30) this means that $h_{i}^{\prime}>0$ .

•

If $\kappa>2\sqrt{\log n}$ , then $\epsilon=1/4$ . Because $R=\Omega(n\log^{2}n/\sigma)$ , for large enough $n$ we have

		$\displaystyle 4R(1-\beta)\sigma>\frac{\log n+\log(\kappa+10\sqrt{\log n})-\log(\kappa/4)}{\kappa/4}$
	$\displaystyle\iff$	$\displaystyle 4R(1-\beta)\sigma>\frac{\log n+\log(\kappa+10\sqrt{\log n})-\log(\epsilon\kappa)}{\epsilon\kappa}$
	$\displaystyle\iff$	$\displaystyle e^{4R(1-\beta)\sigma\epsilon\kappa}>\frac{n(\kappa+10\sqrt{\log n})}{\epsilon\kappa}$
	$\displaystyle\iff$	$\displaystyle e^{2R(1-\beta)\sigma\epsilon\kappa}\sigma\epsilon\kappa>n(\kappa\sigma+10\sigma\sqrt{\log n})$

By (30) this means that $h_{i}^{\prime}>0$ .

This shows that $h_{i}^{\prime}>0$ whenever $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0$ . Similarly, one easily gets that $h_{i}^{\prime}<0$ whenever $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0$ . Therefore the proof is complete.

A.9 Proof of Corollary 16

We restate Corollary 16 for convenience. Recall that we write $\|\boldsymbol{\mu}\|=\kappa\sigma$ for some $\kappa>0$ .

Corollary.

•

(Perfect classification) If $\kappa\geq\sqrt{2\log n}$ then all nodes are correctly classified;
•

(Almost perfect classification) If $\kappa=\omega(1)$ then at least $1-o(1)$ fraction of all nodes are correctly classified;
•

(Partial classification) If $\kappa=O(1)$ then at least $\Phi(\kappa)-o(1)$ fraction of all nodes are correctly classified.

Fix $i\in[n]$ and write $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\|\boldsymbol{\mu}\|+\sigma/\|\boldsymbol{\mu}\|\cdot\boldsymbol{\mu}^{T}\boldsymbol{g}_{i}$ where $\boldsymbol{g}_{i}\sim N(0,\mathbf{I})$ , where we recall that $\epsilon_{i}\in\{0,1\}$ defines the class membership of node $i$ . We have that $\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}$ follows a normal distribution with mean $(2{\epsilon}_{i}-1)\kappa\sigma$ and standard deviation $\sigma$ . Part 1 of the corollary is exactly Proposition 7 whose proof is given in the main text. We consider the other cases for $\kappa$ . For both classes $\epsilon_{i}=0$ and $\epsilon_{i}=1$ , the probability of correct classification is $\Phi(\kappa)$ (using 0 as the decision boundary). Therefore, by applying additive Chernoff bound, we have that

		$\displaystyle\mathop{{\bf Pr}\/}\bigg{[}\mbox{At most $n\Phi(\kappa)-\sqrt{n\log n}$ nodes are correctly classified}\bigg{]}$
	$\displaystyle=~{}$	$\displaystyle\mathop{{\bf Pr}\/}\left[\sum_{i\in[n]}\boldsymbol{1}_{\{\text{node $i$ is correctly classified}\}}\leq n\Phi(\kappa)-\sqrt{n\log n}\right]\leq\frac{2}{n}.$

The proof is complete by noticing that $n\Phi(\kappa)-\sqrt{n\log n}=n(1-o(1))$ when $\kappa=\omega(1)$ and $n\Phi(\kappa)-\sqrt{n\log n}=n(\Phi(\kappa)-o(1))$ when $\kappa=O(1)$ .

References

[1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:1–86, 2018.
[2] Robert J Adler, Jonathan E Taylor, et al. Random fields and geometry, volume 80. Springer, 2007.
[3] T.W. Anderson. An introduction to multivariate statistical analysis. John Wiley & Sons, 2003.
[4] A. Athreya, D. E. Fishkind, M. Tang, C. E. Priebe, Y. Park, J. T. Vogelstein, K. Levin, V. Lyzinski, Y. Qin, and D. L. Sussman. Statistical inference on random dot product graphs: A survey. Journal of Machine Learning Research, 18(226):1–92, 2018.
[5] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), page 2001–2009, 2016.
[6] D. Bahdanau, K. H. Cho, , and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.
[7] A. Baranwal, K. Fountoulakis, and A. Jagannath. Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pages 684–693, 2021.
[8] N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral clustering. Biometrika, 104:361–377, 2017.
[9] Cristian Bodnar, Francesco Di Giovanni, Benjamin Chamberlain, Pietro Lio, and Michael Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. Advances in Neural Information Processing Systems (NeurIPS), 35:18527–18541, 2022.
[10] X. Bresson and T. Laurent. Residual gated graph convnets. In arXiv:1711.07553, 2018.
[11] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks. In International Conference on Learning Representations (ICLR), 2022.
[12] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[13] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
[14] Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations (ICLR), 2019.
[15] E. Chien, J. Peng, P. Li, and O. Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021.
[16] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), page 3844–3852, 2016.
[17] Y. Deshpande, A. Montanari S. Sen, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[18] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems (NeurIPS), volume 45, page 2224–2232, 2015.
[19] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[20] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 119, pages 3419–3430, 2020.
[21] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks (IJCNN), 2005.
[22] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 1025–1035, 2017.
[23] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. In arXiv:1506.05163, 2015.
[24] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. B. Ma, H. Chen, and M.-C. Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations (ICLR), 2019.
[25] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
[26] S. Jegelka. Theory of graph neural networks: Representation and learning. In arXiv:2204.07697, 2022.
[27] N. Keriven, A. Bietti, and S. Vaiter. On the universality of graph neural networks on large random graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[28] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
[29] B. Knyazev, G. W. Taylor, and M. Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 4202–4212, 2019.
[30] B. J. Lee, R. A. Rossi, S. Kim, K. N. Ahmed, and E. Koh. Attention models in graphs: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 2019.
[31] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97–109, 2018.
[32] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016.
[33] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 20887–20902, 2021.
[34] A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2020.
[35] A. Loukas. What graph neural networks cannot learn: Depth vs width. In International Conference on Learning Representations (ICLR), 2020.
[36] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 1362–1375, 2022.
[37] S. Maskey, R. Levie, Y. Lee, and G. Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
[38] M. Minsky and S. Papert. Perceptron: an introduction to computational geometry, 1969.
[39] C. Moore. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. Bulletin of The European Association for Theoretical Computer Science, 1(121), 2017.
[40] John Palowitch, Anton Tsitsulin, Brandon Mayer, and Bryan Perozzi. Graphworld: Fake graphs bring real insights for gnns. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 3691–3701, 2022.
[41] O. Puny, H. Ben-Hamu, and Y. Lipman. Global attention improves graph networks generalization. In arXiv:2006.07846, 2020.
[42] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 2009.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), page 6000–6010, 2017.
[44] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
[45] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[46] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
[47] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
[48] Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In IEEE International Conference on Data Mining (ICDM), pages 1287–1292. IEEE, 2022.
[49] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

	$\displaystyle\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$	$\displaystyle=\sum_{j\in C_{0}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}+\sum_{j\in C_{1}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}$
		$\displaystyle\leq\|C_{1}\cap N_{i}\|\left(\frac{2}{np}(1\pm o(1))\left(\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)\right)$
		$\displaystyle\qquad+\|C_{0}\cap N_{i}\|\left(o\left(\frac{1}{n(p+q)}\right)\left(-\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)\right)$
		$\displaystyle=(1\pm o(1))\cdot\left(\\|\boldsymbol{\mu}\\|+10\sigma\sqrt{\log n}\right)-\frac{nq(1\pm o(1))}{\omega(n(p+q))}\cdot\left(\\|\boldsymbol{\mu}\\|-10\sigma\sqrt{\log n}\right)$
		$\displaystyle=\\|\boldsymbol{\mu}\\|(1\pm o(1)).$

	$\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0]$	$\displaystyle\leq\mathop{{\bf Pr}\/}\left[\|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}\|\leq\|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}\|\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\|\mathbf{X}_{i}^{T}\boldsymbol{\mu}+\mathbf{X}_{j}^{T}\boldsymbol{\mu}\|\leq\|\mathbf{X}_{i}^{T}\boldsymbol{\mu}-\mathbf{X}_{j}^{T}\boldsymbol{\mu}\|\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\sigma\|\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}+\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}\|\leq\|\pm 2\\|\boldsymbol{\mu}\\|^{2}+\sigma\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}-\sigma\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}\|\right]$
		$\displaystyle\leq\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|-\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|\leq\frac{2\\|\boldsymbol{\mu}\\|}{\sigma}\right]$
		$\displaystyle=\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|-\|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}\|\leq 2\kappa\right],$

$\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0]$	$\displaystyle\leq\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa]$
	$\displaystyle=\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa,\|z_{i}\|\leq\kappa]$
	$\displaystyle\qquad+\mathop{{\bf Pr}\/}[\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|\leq 2\kappa,\|z_{i}\|>\kappa]$
	$\displaystyle=\mathop{{\bf Pr}\/}[\|z_{i}\|\leq\kappa]+\Phi(\kappa)\mathop{{\bf Pr}\/}[\|z_{i}\|>\kappa].$	(15)

$\displaystyle\|z_{i}+z_{j}\|-\|z_{i}-z_{j}\|$	$\displaystyle=\|z_{i}+z_{j}\|-\|z_{j}-z_{i}\|$
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-\|z_{j}-z_{i}\|$	by triangle inequality
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-\big{\|}\|z_{j}\|-\|z_{i}\|\big{\|}$	by reverse triangle inequality
	$\displaystyle\leq\|z_{i}\|+\|z_{j}\|-(\|z_{j}\|-\|z_{i}\|)=2\|z_{i}\|$
	$\displaystyle\leq 2\kappa,$

	$\displaystyle\operatorname{{\bf Pr}}\left[\|J_{i,0}^{c}\|\geq\frac{1}{10}\|N_{i}\cap C_{0}\|\right]$	$\displaystyle\leq e^{-\mathop{{\bf E}\/}[\|J_{i,0}^{c}\|]}\left(\frac{e\mathop{{\bf E}\/}[\|J_{i,0}^{c}\|]}{\|N_{i}\cap C_{0}\|/10}\right)^{\|N_{i}\cap C_{0}\|/10}$
		$\displaystyle\leq\left(\frac{ee^{-50}\|N_{i}\cap C_{0}\|}{\|N_{i}\cap C_{0}\|/10}\right)^{\|N_{i}\cap C_{0}\|/10}$
		$\displaystyle=\exp\left(-\left(\frac{1}{2}-\frac{\log 10}{10}-\frac{1}{10}\right)\|N_{i}\cap C_{0}\|\right)$
		$\displaystyle\leq\exp\left(-\frac{4}{25}\|N_{i}\cap C_{0}\|\right).$

Graph Attention Retrospective

Abstract

1 Introduction

1.1 Limitations of our theoretical setting

1.2 Relevant work

2 Preliminaries

3 Separation of edges and its implications for graph attention’s ability for perfect node classification

Definition 1 (Edge separability).

Definition 2 (Node separability).

Assumption 1.

3.1 “Easy Regime”

Theorem 3.

Corollary 4.

Corollary 5.

Lemma 6 (Section 6.4 in [3]).

Proposition 7.

3.2 “Hard Regime”

Lemma 8.

Theorem 9.

Theorem 10.

Conjecture 11.

3.2.1 Are good attention coefficients helpful in the “hard regime”?

Lemma 12 (Exact recovery in [1]).

Theorem 13.

4 Robustness to structural noise and its implications beyond perfect node classification

Theorem 14.

Corollary 15.

Corollary 16.

5 Experiments

5.1 Ansatz for GAT, MLP-GAT and GCN

5.2 Synthetic data

5.2.1 Fixing the distance between the means and varying qq

5.2.2 Fixing qq and varying the distance between the means

5.3 Real data

6 Summary of implications for practitioners

6.1 Why graph attention? Benefits of graph attention’s robustness to structural noise

6.2 Which attention architecture? Benefits of multi-layer attention architecture

6.3 Will additional information help? Benefits of incorporating good edge features

7 Conclusion and future work

Acknowledgement

Appendix A Proofs

Definition 17.

Lemma 18 ([7]).

Observation 19.

A.1 Proof of Theorem 3

Theorem.

A.2 Proof of Corollary 4

Corollary.

A.3 Proof of Corollary 5

Corollary.

A.4 Proof of Lemma 8

Lemma.

A.5 Proof of Theorem 9

Theorem.

A.6 Proof of Theorem 10

Theorem.

Claim 20.

Claim 21.

Claim 22.

A.7 Proof of Proposition 13

Proposition 23.

A.8 Proof of Theorem 14

Theorem.

A.9 Proof of Corollary 16

Corollary.

References

5.2.1 Fixing the distance between the means and varying $q$

5.2.2 Fixing $q$ and varying the distance between the means