This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph Attention Retrospective

Kimon Fountoulakis∗†    Amit Levi∗‡    Shenghao Yang∗†

Aseem Baranwal    Aukosh Jagannath§
Abstract

Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an “easy” regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the “hard” regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention’s robustness against structural noise in the graph. In particular, our robustness result implies that graph attention can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data.

**footnotetext: Equal contribution.$\dagger$$\dagger$footnotetext: David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada.$\ddagger$$\ddagger$footnotetext: Huawei Noah’s Ark Lab, Montreal, Canada.$\mathsection$$\mathsection$footnotetext: Department of Statistics and Actuarial Science, Department of Applied Mathematics, University of Waterloo, Waterloo, Canada.00footnotetext: Emails: [email protected], [email protected], [email protected],
[email protected], [email protected]

1 Introduction

Graph learning has received a lot of attention recently due to breakthrough learning models [21, 42, 13, 18, 23, 5, 16, 22, 28] that are able to exploit multi-modal data that consist of nodes and their edges as well as the features of the nodes. One of the most important problems in graph learning is the problem of classification, where the goal is to classify the nodes or edges of a graph given the graph and the features of the nodes. Two of the most fundamental mechanisms for classification, and graph learning in general, are graph convolution and graph attention. Graph convolution, usually defined using its spatial version, corresponds to averaging the features of a node with the features of its neighbors [28].111Although the model in [28] is related to spectral convolutions, it is mainly a spatial convolution since messages are propagated along graph edges. More broadly, graph convolution can refer to different variants arising from different (approximations of) graph spectral filters. We provide details in Section 2. Graph attention [44] mechanisms augment this convolution by appropriately weighting the edges of a graph before spatially convolving the data. Graph attention is able to do this by using information from the given features for each node. Despite its wide adoption by practitioners [19, 46, 25] and its large academic impact as well, the number of works that rigorously study its effectiveness is quite limited.

One of the motivations for using a graph attention mechanism as opposed to a simple convolution is the expectation that the attention mechanism is able to distinguish inter-class edges from intra-class edges and consequently weights inter-class edges and intra-class edges differently before performing the convolution step. This ability essentially maintains the weights of important edges and significantly reduces the weights of unimportant edges, and thus it allows graph convolution to aggregate features from a subset of neighbor nodes that would help node classification tasks. In this work, we explore the regimes in which this heuristic picture holds in simple node classification tasks, namely classifying the nodes in a contextual stochastic block model (CSBM) [8, 17]. The CSBM is a coupling of the stochastic block model (SBM) with a Gaussian mixture model, where the features of the nodes within a class are drawn from the same component of the mixture model. For a more precise definition, see Section 2. We focus on the case of two classes where the answer to the above question is sufficiently precise to understand the performance of graph attention and build useful intuition about it. We briefly and informally summarize our contributions as follows:

  1. 1.

    In the “easy regime”, i.e., when the distance between the means of the Gaussian mixture model is much larger than the standard deviation, we show that there exists a choice of attention architecture that distinguishes inter-class edges from intra-class edges with high probability (Theorem 3). In particular, we show that the attention coefficients for one class of edges are much higher than the other class of edges (Corollary 4). Furthermore, we show that these attention coefficients lead to a perfect node classification result (Corollary 5). However, in the same regime, we also show that the graph is not needed to perfectly classify the data (Proposition 7).

  2. 2.

    In the “hard regime”, i.e., when the distance between the means is small compared to the standard deviation, we show that every attention architecture is unable to distinguish inter-class from intra-class edges with high probability (Theorem 9). Moreover, we show that using the original Graph Attention Network (GAT) architecture [44], with high probability, most of the attention coefficients are going to have uniform weights, similar to those of simple graph convolution [28] (Theorem 10). We also consider a setting where graph attention is provided with independent and perfect edge information, and we show that even in this case, if the distance between the means is sufficiently small, then graph attention would fail to (almost) perfectly classify the nodes with high probability (Theorem 13).

  3. 3.

    Beyond perfect node classification, we show that graph attention is able to assign a significantly higher weight to self-loops, irrespective of the parameters of the Gaussian mixture model that generates node features or the stochastic block model that generates the graph. Consequently, we show that with a high probability, graph attention convolution is at least as good as the best linear classifier for node features (Theorem 14). In a high structural noise regime when the graph is not helpful for node classification, i.e., when intra-class edge probability pp is close to inter-class edge probability qq, this means that graph attention is strictly better than simple graph convolution because it is able to essentially ignore the graph structural noise (Corollary 15). In addition, we obtain lower bounds on graph attention’s performance for almost perfect node classification and partial node classification (Corollary 16).

  4. 4.

    We provide an extensive set of experiments both on synthetic data and on three popular real-world datasets that validates our theoretical results.

1.1 Limitations of our theoretical setting

In this work, to study the benefits and the limitations of the graph attention mechanism, we focus on node classification tasks using the CSBM generative model with two classes and Gaussian node features. This theoretical setting has a few limitations. First, we note that many real-world networks are much more complex and may exhibit different structures than the ones obtainable from a stochastic block model. Furthermore, real-world node attributes such as categorical features may not even follow a continuous probability distribution. Apparently, there are clear gaps between CSBM and the actual generative processes of real-world data. Nonetheless, we note that a good understanding of graph attention’s behavior over CSBM would already provide us with useful insights. For example, it has been shown empirically that synthetic graph datasets based on CSBM constitute good benchmarks for evaluating various GNN architectures [40]. To that end, in Section 6, we summarize some practical implications of our theoretical results which could be useful for practitioners working with GNNs.

Second, there are different levels of granularity for machine learning problems on graphs. Besides node-level tasks such as node classification, other important problems include edge-level tasks such as link prediction and graph-level tasks such as graph classification. While, empirically, variants of graph attention networks have been successfully applied to solve problems at all levels of granularity, our theoretical results mostly pertain to the effects of graph attention for node classification. This is a limitation on the scope of our study. On the other hand, we note that edge-level and graph-level tasks are typically solved by adding a pooling layer which combines node representations from previous graph (attention) convolution layers. Consequently, the quality of graph attention convolution’s output for a node not only affects node classification but also has a major influence on link prediction and graph classification. Therefore, our results may be extended to study the effects of graph attention on link prediction and graph classification under the CSBM generative model. For example, link prediction is closely related to edge classification which we extensively study in Section 3. For graph classification, one may consider the problem of classifying graphs generated from different parameter settings of the CSBM. In this case, our results on graph attention’s impact on node representations may be used to establish results on graph attention’s impact on graph representations after certain pooling layers. In particular, if graph attention convolution generates good node representations that are indicative of node class memberships, then one can show that the graph representation obtained from appropriately aggregating node representations would be indicative of the clustering structure of the graph, and hence the graph representation would be useful for graph classification tasks.

1.2 Relevant work

Recently the concept of attention for neural networks [6, 43] was transferred to graph neural networks [32, 10, 44, 30, 41]. A few papers have attempted to understand the attention mechanism in [44]. One work relevant to ours is [11]. In this paper, the authors show that a node may fail to assign large edge weight to its most important neighbors due to a global ranking of nodes generated by the attention mechanism in [44]. Another related work is [29], which presents an empirical study of the ability of graph attention to generalize on larger, complex, and noisy graphs. In addition, in [24] the authors propose a different metric to generate the attention coefficients and show empirically that it has an advantage over the original GAT architecture.

Other related work to ours, which does not focus on graph attention, comes from the field of statistical learning on random data models. Random graphs and the stochastic block model have been traditionally used in clustering and community detection [1, 4, 39]. Moreover, the works by [8, 17], which also rely on CSBM are focused on the fundamental limits of unsupervised learning. Of particular relevance is the work by [7], which studies the performance of graph convolution on CSBM as a semi-supervised learning problem. Within the context of random graphs, [27] studies the approximation power of graph neural networks on random graphs. In [37] the authors derive generalization error of graph neural networks for graph classification and regression tasks. In our paper we are interested in understanding graph attention’s capability for edge/node classification with respect to different parameter regimes of CSBM.

Finally, there are a few related theoretical works on understanding the generalization and representation power of graph neural networks [14, 15, 49, 47, 20, 34, 35]. For a recent survey in this direction see [26]. Our work takes a statistical perspective which allows us to characterize the precise performance of graph attention compared to graph convolution and no convolution for CSBM, with the goal of answering the particular questions that we imposed above.

2 Preliminaries

In this section, we describe the Contextual Stochastic Block Model (CSBM) [17] which serves as our data model, and the Graph Attention mechanism [44]. We also define notations and terms that are frequently used throughout the paper.

For a vector xnx\in\mathbb{R}^{n} and nn\in\mathbbm{N}, the norm x\|x\| denotes the Euclidean norm of xx, i.e. x=defi[n]xi2\|x\|\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{i\in[n]}x_{i}^{2}. We write [n]=def{1,2,,n}[n]\stackrel{{\scriptstyle\rm def}}{{=}}\{1,2,\ldots,n\}. We use Ber(p)\operatorname{Ber}(p) to denote the Bernoulli distribution, so xBer(p)x\sim\operatorname{Ber}(p) means the random variable xx takes value 1 with probability pp and 0 with probability 1p1-p. Let d,nd,n\in\mathbbm{N}, and ϵ1,,ϵnBer(1/2){\epsilon}_{1},\ldots,{\epsilon}_{n}\sim\operatorname{Ber}(1/2). Define two classes as Ck={j[n]ϵj=k}C_{k}=\{j\in[n]\mid{\epsilon}_{j}=k\} for k{0,1}k\in\{0,1\}. For each index i[n]i\in[n], we set the feature vector 𝐗id\mathbf{X}_{i}\in\mathbb{R}^{d} as 𝐗iN((2ϵi1)𝝁,σ2𝐈)\mathbf{X}_{i}\sim N((2{\epsilon}_{i}-1)\boldsymbol{\mu},\sigma^{2}\mathbf{I}), where 𝝁d\boldsymbol{\mu}\in\mathbb{R}^{d}, σ\sigma\in\mathbb{R} and 𝐈{0,1}d×d\mathbf{I}\in\{0,1\}^{d\times d} is the identity matrix.222The means of the mixture of Gaussians are ±𝝁\pm\boldsymbol{\mu}. Our results can be easily generalized to general means. The current setting makes our analysis simpler without loss of generality. For a given pair p,q[0,1]p,q\in[0,1] we consider the stochastic adjacency matrix 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n} defined as follows. For i,j[n]i,j\in[n] in the same class (i.e., intra-class edge), we set aijBer(p)a_{ij}\sim\operatorname{Ber}(p), and if i,ji,j are in different classes (i.e., inter-class edge), we set aijBer(q)a_{ij}\sim\operatorname{Ber}(q). We denote by (𝐗,𝐀)CSBM(n,p,q,𝝁,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}) a sample obtained according to the above random process.

An advantage of CSBM is that it allows us to control the noise by controlling the parameters of the distributions of the model. In particular, CSBM allows us to control the distance of the means and the variance of the Gaussians, which are important for controlling the separability of the Gaussians. For example, fixing the variance, then the closer the means are the more difficult the separation of the node Gaussian features becomes. Another notable advantage of CSBM is that it allows us to control the structural noise and homophily level of the graph. The level of noise in the graph is controlled by increasing or reducing the gap between the intra-class edge probability pp and the inter-class edge probability qq, and the level of homophily in the graph is controlled by the relative magnitude between pp and qq. For example, when pp is much larger than qq, then a node is more likely to be connected with a node from the same class, and hence we obtain a homophilous graph; when qq is much larger than pp, then a node is more likely to be connected with a node from a different class, and hence we obtain a heterophilous graph. There are several recent works exploring the behavior of GNNs in heterophilous graphs [33, 48, 9, 36]. Interestingly, we note that our results for graph attention’s behavior over the CSBM data model do not depend on whether the graph is homophilous or heterophilous.

Given node representations 𝒉iF\boldsymbol{h}_{i}\in\mathbb{R}^{F^{\prime}} for i[n]i\in[n], a (spatial) graph convolution for node ii has output

𝒉i=j[n]𝐀ijcij𝐖𝒉j,cij=([n]𝐀i)1,\boldsymbol{h}_{i}^{\prime}=\sum_{j\in[n]}\mathbf{A}_{ij}c_{ij}\mathbf{W}\boldsymbol{h}_{j},\quad c_{ij}=\bigg{(}\sum_{\ell\in[n]}\mathbf{A}_{i\ell}\bigg{)}^{-1},

where 𝐖F×F\mathbf{W}\in\mathbb{R}^{F\times F^{\prime}} is a learnable matrix. Throughout this paper, we will refer to this operation as simple graph convolution or standard graph convolution. Our definition of graph convolution is essentially the mean aggregation step in a general GNN layer [22]. The normalization constant cijc_{ij} in our definition is closely related to the symmetric normalization cij=([n]𝐀i)1/2([n]𝐀j)1/2c_{ij}=(\sum_{\ell\in[n]}\mathbf{A}_{i\ell})^{-1/2}(\sum_{\ell\in[n]}\mathbf{A}_{j\ell})^{-1/2} used in the original Graph Convolutional Network (GCN) introduced by [28]. Our definition does not affect the discussions or implications we have for GCN with symmetric normalization. More broadly, there are other forms of graph convolution in the literature [12, 16, 31], which we do not compare within this work.

A single-head graph attention applies some weight function on the edges based on their node features (or a mapping thereof). Given two representations 𝒉i,𝒉jF\boldsymbol{h}_{i},\boldsymbol{h}_{j}\in\mathbb{R}^{F^{\prime}} for two nodes i,j[n]i,j\in[n], the attention model/mechanism is defined as the mapping

Ψ(𝒉i,𝒉j)=defα(𝐖𝒉i,𝐖𝒉j)\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{j})\stackrel{{\scriptstyle\rm def}}{{=}}\alpha(\mathbf{W}\boldsymbol{h}_{i},\mathbf{W}\boldsymbol{h}_{j})

where α:F×F\alpha:\mathbb{R}^{F}\times\mathbb{R}^{F}\to\mathbb{R} and 𝐖F×F\mathbf{W}\in\mathbb{R}^{F\times F^{\prime}} is a learnable matrix. The attention coefficient for a node ii and its neighbor jj is defined as

γij=defexp(Ψ(𝒉i,𝒉j))Niexp(Ψ(𝒉i,𝒉)),\displaystyle\gamma_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\exp(\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{j}))}{\sum_{\ell\in N_{i}}\exp(\Psi(\boldsymbol{h}_{i},\boldsymbol{h}_{\ell}))}, (1)

where NiN_{i} is the set of neighbors of node ii that also includes node ii itself. Let ff be some element-wise activation function (which is usually nonlinear), the graph attention convolution output for a node i[n]i\in[n] is given by

𝒉i=j[n]𝐀ijγij𝐖𝒉j,𝒉~i=f(𝒉i).\begin{split}{\boldsymbol{h}}^{\prime}_{i}&=\sum_{j\in[n]}\mathbf{A}_{ij}\gamma_{ij}\mathbf{W}\boldsymbol{h}_{j},\\ \tilde{\boldsymbol{h}}_{i}&=f(\boldsymbol{h}^{\prime}_{i}).\end{split} (2)

A multi-head graph attention [44] uses KK\in\mathbbm{N} weight matrices 𝐖1\mathbf{W}^{1}, \ldots, 𝐖KF×F\mathbf{W}^{K}\in\mathbb{R}^{F\times F^{\prime}} and averages their individual (single-head) outputs. We consider the most simplified case of a single graph attention layer (i.e., F=dF^{\prime}=d and F=1F=1) where α\alpha is realized by an MLP using the LeakyRelu activation function. The LeakyRelu activation function is defined as LeakyRelu(x)=x\mathrm{LeakyRelu}(x)=x if x0x\geq 0 and LeakyRelu(x)=βx\mathrm{LeakyRelu}(x)=\beta x for some constant β[0,1)\beta\in[0,1) if x<0x<0.

The CSBM model induces node features 𝐗\mathbf{X} which are correlated through the graph G=([n],E)G=([n],E), represented by an adjacency matrix 𝐀\mathbf{A}. A natural requirement of attention architectures is to maintain important edges in the graph and ignore unimportant edges. For example, important edges could be the set of intra-class edges and unimportant edges could be the set of inter-class edges. In this case, if graph attention mains all intra-class edges and ignores all inter-class edges, then a node from a class will be connected only to nodes from its own class. More specifically, a node vv will be connected to neighbor nodes whose associated node features come from the same distribution as node features of vv. Given two sets AA and BB, we denote A×B=def{(i,j):iA,jB}A\times B\stackrel{{\scriptstyle\rm def}}{{=}}\{(i,j):i\in A,j\in B\} and A2=defA×AA^{2}\stackrel{{\scriptstyle\rm def}}{{=}}A\times A.

3 Separation of edges and its implications for graph attention’s ability for perfect node classification

In this section, we study the ability of graph attention to separate important edges from unimportant edges. In addition, we study the implications of graph attention’s behavior for node classification. In particular, we are interested in whether or not the graph attention convolution is able to perfectly classify the nodes in the graph. Depending on the parameter regimes of the CSBM, we separate the results into two parts. In Section 3.1, we study graph attention’s behavior when the distance between the means of the node features is large. We construct a specific attention architecture and demonstrate its ability to separate intra-class edges from inter-class edges. Consequently, we show that the corresponding graph attention convolution is able to perfectly classify the nodes. In Section 3.2, we study graph attention’s behavior when the distance between the means of the node features is small. We show that with high probability no attention architecture is able to separate intra-class edges from inter-class edges. Consequently, we conjecture that no graph attention is able to perfectly classify the nodes. We provide evidence of this conjecture in Section 3.2.1, where we assume the existence of a strong attention mechanism that is able to perfectly classify the edges independently from node features. We show that using this “idealistic” attention mechanism still fails to (almost) perfectly classify the nodes, provided that the distance between the means of the node features is sufficiently small.

The two-parameter regimes of the CSBM that we consider in Section 3.1 and Section 3.2 are as follows. The first (“easy regime”) is where 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}), and the second (“hard regime”) is where 𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for some 0<κO(logn)0<\kappa\leq O(\sqrt{\log n}). We start by defining edge separability and node separability.

Definition 1 (Edge separability).

Given an attention model Ψ\Psi, we say that the model separates the edges, if the outputs satisfy

sign(Ψ(𝐗i,𝐗j))=sign(Ψ(𝐗k,𝐗))\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=-\operatorname{sign}(\Psi(\mathbf{X}_{k},\mathbf{X}_{\ell}))

whenever (i,j)(i,j) is an intra-class edge, i.e. (i,j)(C12C02)E(i,j)\in(C_{1}^{2}\cup C_{0}^{2})\cap E, and (k,)(k,\ell) is an inter-class edge, i.e. (k,)E(C12C02)(k,\ell)\in E\setminus(C_{1}^{2}\cup C_{0}^{2}).

Definition 2 (Node separability).

Given a classification model which outputs a scalar representation hih^{\prime}_{i} for node ii, we say that the model separates the nodes if hi>0h^{\prime}_{i}>0 whenever iC1i\in C_{1} and hi<0h^{\prime}_{i}<0 whenever iC0i\in C_{0}.

Some of our results in this section rely on a mild assumption that lower bounds the sparsity of the graph generated by the CSBM model. This assumption says that the expected degree of a node in the graph is larger than log2n\log^{2}n. It is required to obtain a concentration of node degrees property used in the proofs. While this assumption may limit a direct application of our results to very sparse graphs, for example, graphs whose node degrees are bounded by a constant, it is mainly an artifact of the proof techniques that we use. We expect that similar results still hold for sparser graphs, but to rigorously remove the assumption on graph density one may have to adopt a different proof technique.

Assumption 1.

p,q=Ω(log2n/n)p,q=\Omega(\log^{2}n/n).

3.1 “Easy Regime”

In this regime (𝝁=ω(σlogn))\left(\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n})\right) we show that a two-layer MLP attention is able to separate the edges with high probability. Consequently, we show that the corresponding graph attention convolution is able to separate the nodes with high probability. At a high level, we transform the problem of classifying an edge (i,j)E(i,j)\in E into the problem of classifying a point [𝒘~T𝐗i,𝒘~T𝐗j][\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}] in 2\mathbb{R}^{2}, where 𝒘~=𝝁/𝝁\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\| is a unit vector that maximizes the total pairwise distances among the four means given below. When we consider the set of points [𝒘~T𝐗i,𝒘~T𝐗j][\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}] for all (i,j)E(i,j)\in E, we can think of each point as a two-dimensional Gaussian vector whose mean is one of the following: [𝒘~T𝝁,𝒘~T𝝁][\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}], [𝒘~T𝝁,𝒘~T𝝁][-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}], [𝒘~T𝝁,𝒘~T𝝁][\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}], [𝒘~T𝝁,𝒘~T𝝁][-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]. The set of intra-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [𝒘~T𝝁,𝒘~T𝝁][\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}] or [𝒘~T𝝁,𝒘~T𝝁][-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}], while the set of inter-class edges corresponds to the set of bivariate Gaussian vectors whose mean is either [𝒘~T𝝁,𝒘~T𝝁][-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}] or [𝒘~T𝝁,𝒘~T𝝁][\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}]. Therefore, in order to correctly classify the edges, we need to correctly classify the data corresponding to means [𝒘~T𝝁,𝒘~T𝝁][\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}] and [𝒘~T𝝁,𝒘~T𝝁][-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu},-\tilde{\boldsymbol{w}}^{T}\boldsymbol{\mu}] as 1, and classify the data corresponding to the other means as 1-1. This problem is known in the literature as the “XOR problem” [38]. To achieve this we consider a two-layer MLP architecture Ψ\Psi which separates the first and the third quadrants of the two-dimensional space from the second and the fourth quadrants. In particular, we consider the following specification of Ψ(𝐗i,𝐗j)\Psi(\mathbf{X}_{i},\mathbf{X}_{j}),

Ψ(𝐗i,𝐗j)=def𝒓TLeakyRelu(𝐒[𝒘~T𝐗i𝒘~T𝐗j]),\Psi(\mathbf{X}_{i},\mathbf{X}_{j})\stackrel{{\scriptstyle\rm def}}{{=}}\boldsymbol{r}^{T}\mathrm{LeakyRelu}\left(\mathbf{S}\begin{bmatrix}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\\ \tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\end{bmatrix}\right), (3)

where

𝒘~=def𝝁𝝁,𝐒=def[11111111],𝒓=defR[1111],\tilde{\boldsymbol{w}}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\boldsymbol{\mu}}{\|\boldsymbol{\mu}\|},\quad\mathbf{S}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{bmatrix}\hskip 8.53581pt1&\hskip 8.53581pt1\\ -1&-1\\ \hskip 8.53581pt1&-1\\ -1&\hskip 8.53581pt1\end{bmatrix},\quad\boldsymbol{r}\stackrel{{\scriptstyle\rm def}}{{=}}R\cdot\begin{bmatrix}\hskip 8.53581pt1\\ \hskip 8.53581pt1\\ -1\\ -1\end{bmatrix}, (4)

where R>0R>0 is an arbitrary scaling parameter. The particular function Ψ\Psi has been chosen such that it is able to classify the means of the XOR problem correctly, that is,

sign(Ψ(𝐄[𝐗i],𝐄[𝐗j]))={1,if (i,j) is an intra-class edge,1,if (i,j) is an inter-class edge.\displaystyle\operatorname{sign}(\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}]))=\left\{\begin{array}[]{ll}\hskip 8.53581pt1,&\mbox{if $(i,j)$ is an intra-class edge},\\ -1,&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right.

To see this, we can take look at the following simplified expression of Ψ\Psi, which is obtained by substituting the specifications of 𝐒\mathbf{S} and 𝒓\boldsymbol{r} from (4) to (3),

Ψ(𝐗i,𝐗j)={2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|,2R(1β)sign(𝒘~T𝐗i)𝒘~T𝐗j,if|𝒘~T𝐗i|<𝒘~T𝐗j<|𝒘~T𝐗i|,2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right. (5)

Then, by substituting 𝒘~=𝝁/𝝁\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\| into (5), one easily verifies that

Ψ(𝐄[𝐗i],𝐄[𝐗j])={2R(1β)𝝁,if (i,j) is an intra-class edge,2R(1β)𝝁,if (i,j) is an inter-class edge.\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}])=\left\{\begin{array}[]{ll}\hskip 8.53581pt2R(1-\beta)\|\boldsymbol{\mu}\|,&\mbox{if $(i,j)$ is an intra-class edge},\\ -2R(1-\beta)\|\boldsymbol{\mu}\|,&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right. (6)

On the other hand, our assumption that 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) guarantees that the distance between the means of the XOR problem is much larger than the standard deviation of the Gaussians, and thus with high probability there is no overlap between the distributions. More specifically, one can show that with a high probability,

Ψ(𝐗i,𝐗j)={2R(1β)𝝁(1±o(1)),if (i,j) is an intra-class edge,2R(1β)𝝁(1±o(1)),if (i,j) is an inter-class edge.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}\hskip 8.53581pt2R(1-\beta)\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ -2R(1-\beta)\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right. (7)

This property guarantees that with high probability,

sign(Ψ(𝐗i,𝐗j))=sign(Ψ(𝐄[𝐗i],𝐄[𝐗j])),for all(i,j)E,\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=\operatorname{sign}(\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}])),\ \mbox{for all}\ (i,j)\in E,

which implies perfect separability of the edges. We state this result below in Theorem 3 and provide detailed arguments in the appendix.

Theorem 3.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}). Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) separates intra-class edges from inter-class edges.

Our analysis of edge separability by the attention architecture Ψ\Psi has two important implications. First, edge separability by Ψ\Psi implies a nice concentration result for the attention coefficients γij\gamma_{ij} defined in (1). In particular, one can show that the attention coefficients have the desirable property to maintain important edges and drop unimportant edges. Second, the nice distinguishing power of the attention coefficients in turn implies the exact recoverability of the node memberships using the graph attention convolution. We state these two results below in Corollary 4 and Corollary 5, respectively. Denote

Ψ=def(𝟏pq𝟏p<q)Ψ,\Psi^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\cdot\Psi,

that is, Ψ=Ψ\Psi^{\prime}=\Psi if pqp\geq q and Ψ=Ψ\Psi^{\prime}=-\Psi if p<qp<q, where Ψ\Psi is the two-layer MLP attention architecture given in (3) and (4). As it will become clear later, the attention architecture Ψ\Psi^{\prime} allows attention coefficients to assign more weights to either intra-class or inter-class edges, depending on if pqp\geq q or p<qp<q, and this will help us obtain a perfect separation of the nodes.

Corollary 4.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) and that Assumption 1 holds. Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), the attention architecture Ψ\Psi^{\prime} yields attention coefficients γij\gamma_{ij} such that

  1. 1.

    If pqp\geq q, then γij=2np(1±o(1))\gamma_{ij}=\frac{2}{np}(1\pm o(1)) if (i,j)(i,j) is an intra-class edge and γij=o(1n(p+q))\gamma_{ij}=o(\frac{1}{n(p+q)}) otherwise;

  2. 2.

    If p<qp<q, then γij=2nq(1±o(1))\gamma_{ij}=\frac{2}{nq}(1\pm o(1)) if (i,j)(i,j) is an inter-class edge and γij=o(1n(p+q))\gamma_{ij}=o(\frac{1}{n(p+q)}) otherwise.

Corollary 5.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) and that Assumption 1 holds. Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the attention architecture Ψ\Psi^{\prime} with the graph attention convolution given in (2), where ff is set to be the identify function, the model separates the nodes.

Corollary 4 shows the desired behavior of the attention mechanism, namely it is able to assign significantly large weights to important edges while it drops unimportant edges. When pqp\geq q, the attention mechanism maintains intra-class edges and essentially ignores all inter-class edges; when p<qp<q, it maintains inter-class edges and essentially ignores all intra-class edges. We now explain the high-level idea of the proof and leave detailed arguments to Appendix A.2. Corollary 4 follows from the concentration of Ψ(𝐗i,𝐗j)\Psi(\mathbf{X}_{i},\mathbf{X}_{j}) around Ψ(𝐄[𝐗i],𝐄[𝐗j])\Psi(\mathop{{\bf E}\/}[\mathbf{X}_{i}],\mathop{{\bf E}\/}[\mathbf{X}_{j}]) given by (7). Assume for a moment that pqp\geq q so that 𝟏pq𝟏p<q=1\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q}=1. Then Ψ=Ψ\Psi^{\prime}=\Psi, and (7) implies that the value of exp(Ψ(𝐗i,𝐗j))\exp(\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{j})) when (i,j)(i,j) is an intra-class edge is exponentially larger than the value of exp(Ψ(𝐗i,𝐗j))\exp(\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{j})) when (i,j)(i,j) is an inter-class edge. Therefore, by the definition of the attention coefficients in (1), the denominator of γij\gamma_{ij} is dominated by terms (i,k)(i,k) where kk is in the same class as ii. Moreover, using concentration of node degrees guaranteed by Assumption 1, each node ii is connected to 2np(1±o(1))\frac{2}{np}(1\pm o(1)) many intra-class nodes. By appropriately setting the scaling parameter RR in (4), the values of Ψ(𝐗i,𝐗k)\Psi^{\prime}(\mathbf{X}_{i},\mathbf{X}_{k}) for all intra-class edges (i,k)(i,k) can be made within o(1)o(1) multiplicative factor from each other. Therefore we get γij=2np(1±o(1))\gamma_{ij}=\frac{2}{np}(1\pm o(1)) when (i,j)(i,j) is an intra-class edge. A similar reasoning applies to inter-class edges and yields the vanishing value of γij\gamma_{ij} when (i,j)(i,j) is an inter-class edge. Finally, the argument for p<qp<q follows analogously.

The concentration result of attention coefficients in Corollary 4 implies the node classification result in Corollary 5, which holds for any p,qp,q satisfying Assumption 1. That is, even when the graph structure is noisy, e.g., when pqp\approx q, it is still possible to exactly recover the node class memberships. Essentially, by applying Corollary 4 and carrying out some straightforward algebraic simplifications, one can show that with high probability for all i[n]i\in[n],

hi=jNiγij𝒘~T𝐗j={(𝟏pq𝟏p<q)𝝁(1±o(1)),if iC0,(𝟏pq𝟏p<q)𝝁(1±o(1)),if iC1,h^{\prime}_{i}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}=\left\{\begin{array}[]{ll}-(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $i\in C_{0}$},\\ \hskip 8.53581pt(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $i\in C_{1}$},\end{array}\right.

and hence the model separates the nodes with high probability. We provide details in the appendix.

While Corollary 5 provides a positive result for graph attention, it can be shown that a simple linear classifier which does not use the graph at all achieves perfect node separability with high probability. In particular, the Bayes optimal classifier for the node features without the graph is able to separate the nodes with high probability. This means that in this regime, using the additional graph information is unnecessary, as it does not provide additional power compared to a simple linear classifier for the node classification task.

Lemma 6 (Section 6.4 in [3]).

Let (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}). Then the optimal Bayes classifier for 𝐗\mathbf{X} is realized by the linear classifier

h(𝐗i)={0if 𝝁T𝐗i01if 𝝁T𝐗i>0.h(\mathbf{X}_{i})=\begin{cases}0&\text{if }\boldsymbol{\mu}^{T}\mathbf{X}_{i}\leq 0\\ 1&\text{if }\boldsymbol{\mu}^{T}\mathbf{X}_{i}>0\end{cases}. (8)
Proposition 7.

Suppose 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}). Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), the linear classifier given in (8) separates the nodes.

The proof of Proposition 7 is elementary. To see the claim one may show that the probability that the classifier in (8) misclassifies a node i[n]i\in[n] is o(1)o(1). To do this, let us fix i[n]i\in[n] and write 𝐗i=(2ϵi1)𝝁+σ𝒈i\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i} where 𝒈iN(0,𝐈)\boldsymbol{g}_{i}\sim N(0,\mathbf{I}). Assume for a moment ϵi=0{\epsilon}_{i}=0. Then the probability of misclassification is

𝐏𝐫[𝝁T𝐗i>0]=𝐏𝐫[𝝁T𝒈i𝝁>𝝁σ]=1Φ(𝝁σ),\mathop{{\bf Pr}\/}\left[\boldsymbol{\mu}^{T}\mathbf{X}_{i}>0\right]=\mathop{{\bf Pr}\/}\left[\frac{\boldsymbol{\mu}^{T}\boldsymbol{g}_{i}}{\|\boldsymbol{\mu}\|}>\frac{\|\boldsymbol{\mu}\|}{\sigma}\right]=1-\Phi\left(\frac{\|\boldsymbol{\mu}\|}{\sigma}\right),

where Φ()\Phi(\cdot) is the cumulative distribution function of N(0,1)N(0,1) and the last equality follows from the fact that 𝝁T𝒈i𝝁N(0,1)\frac{\boldsymbol{\mu}^{T}\boldsymbol{g}_{i}}{\|\boldsymbol{\mu}\|}\sim N(0,1). The assumption that 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) implies 𝝁σ2logn\|\boldsymbol{\mu}\|\geq\sigma\sqrt{2\log n} for large enough nn. Therefore, using standard tail bounds for normal distribution [45] we have that

1Φ(𝝁σ)σ2π𝝁exp(𝝁22σ2)n14πlogn.1-\Phi\left(\frac{\|\boldsymbol{\mu}\|}{\sigma}\right)\leq\frac{\sigma}{\sqrt{2\pi}\|\boldsymbol{\mu}\|}\exp\left(-\frac{\|\boldsymbol{\mu}\|^{2}}{2\sigma^{2}}\right)\leq\frac{n^{-1}}{\sqrt{4\pi\log n}}.

This means that the probability that there exists iC0i\in C_{0} which is misclassified is at most 124πlogn=o(1)\frac{1}{2\sqrt{4\pi\log n}}=o(1). A similar argument can be applied to the case where ϵi=1{\epsilon}_{i}=1, and an application of a union bound on the events that there is i[n]i\in[n] which is misclassified finishes the proof of Proposition 7.

3.2 “Hard Regime”

In this regime (𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for κO(logn)\kappa\leq O(\sqrt{\log n})), we show that every attention architecture Ψ\Psi fails to separate the edges if κ<2logn\kappa<\sqrt{2\log n}, and we conjecture that no graph attention convolution is able to separate the nodes. The conjecture is based on our node separability result in Section 3.2.1 which says that, even if we assume that there is an attention architecture which is able to separate the edges independently from node features, the corresponding graph attention convolution still fails to (almost) perfectly classify the nodes with high probability.

The goal of the attention mechanism is to decide whether an edge (i,j)(i,j) is an inter-class edge or an intra-class edge based on the node feature vectors 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j}. Let 𝐗ij\mathbf{X}^{\prime}_{ij} denote the vector obtained from concatenating 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j}, that is,

𝐗ij=def(𝐗i𝐗j).\mathbf{X}^{\prime}_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\mathbf{X}_{i}\\ \mathbf{X}_{j}\end{pmatrix}. (9)

We would like to analyze every classifier hh^{\prime} which takes as input 𝐗ij\mathbf{X}^{\prime}_{ij} and tries to separate inter-class edges and intra-class edges. An ideal classifier would have the property

y=h(𝐗ij)={0,if (i,j) is an inter-class edge,1,if (i,j) is an intra-class edge.y=h^{\prime}(\mathbf{X}^{\prime}_{ij})=\left\{\begin{array}[]{ll}0,&\mbox{if $(i,j)$ is an inter-class edge},\\ 1,&\mbox{if $(i,j)$ is an intra-class edge}.\end{array}\right. (10)

To understand the limitations of all such classifiers in this regime, it suffices to consider the Bayes optimal classifier for this data model, whose probability of misclassifying of an arbitrary edge lower bounds that of every attention architecture which takes as input (𝐗i,𝐗j)(\mathbf{X}_{i},\mathbf{X}_{j}). Consequently, by deriving a misclassification rate for the Bayes classifier, we obtain a lower bound on the misclassification rate for every attention mechanism Ψ\Psi for classifying intra-class and inter-class edges. The following Lemma 8 describes the Bayes classifier for this classification task.

Lemma 8.

Let (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}) and let 𝐗ij\mathbf{X}^{\prime}_{ij} be defined as in (9). The Bayes optimal classifier for 𝐗ij\mathbf{X}^{\prime}_{ij} is realized by the following function,

h(𝒙)={0,ifpcosh(𝒙T𝝁σ2)qcosh(𝒙T𝝂σ2),1,otherwise,h^{*}(\boldsymbol{x})=\left\{\begin{array}[]{ll}0,&\text{if}\ p\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right),\\ 1,&\text{otherwise},\end{array}\right. (11)

where 𝛍=def(𝛍𝛍)\boldsymbol{\mu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}\end{pmatrix} and 𝛎=def(𝛍𝛍)\boldsymbol{\nu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ -\boldsymbol{\mu}\end{pmatrix}.

Using Lemma 8, we can lower bound the rate of misclassification of edges that every attention mechanism Ψ\Psi exhibits. Below we define Φc=def1Φ\Phi_{\mathrm{c}}\stackrel{{\scriptstyle\rm def}}{{=}}1-\Phi, where Φ\Phi is the cumulative distribution function of N(0,1)N(0,1).

Theorem 9.

Suppose 𝛍=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for some κ>0\kappa>0 and let Ψ\Psi be any attention mechanism. Then,

  1. 1.

    With probability at least 1o(1)1-o(1), Ψ\Psi fails to correctly classify at least 2Φc(κ)22\cdot\Phi_{\mathrm{c}}(\kappa)^{2} fraction of inter-class edges;

  2. 2.

    For any K>0K>0 if q>Klog2nnΦc(κ)2q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}}, then with probability at least 1O(n8KΦc(κ)2logn)1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n}), Ψ\Psi misclassify at least one inter-class edge.

Part 1 of Theorem 9 implies that if 𝝁\|\boldsymbol{\mu}\| is linear in the standard deviation σ\sigma, that is if κ=O(1)\kappa=O(1), then with overwhelming probability the attention mechanism fails to distinguish a constant fraction of inter-class edges from intra-class edges. Furthermore, part 2 of Theorem 9 characterizes a regime for the inter-class edge probability qq where the attention mechanism fails to distinguish at least one inter-class edge. It provides a lower bound on qq in terms of the scale at which the distance between the means grows compared to the standard deviation σ\sigma. This aligns with the intuition that as we increase the distance between the means, it gets easier for the attention mechanism to correctly distinguish inter-class and intra-class edges. However, if qq is also increased with the right proportion, in other words, if the noise in the graph is increased, then the attention mechanism would still fail to correctly distinguish at least one inter-class edge. For instance, for κ=2loglogn\kappa=\sqrt{2\log\log n} and K=log2nK=\log^{2}n, we get that if q>Ω(log6+o(1)nn)q>\Omega(\frac{\log^{6+o(1)}n}{n}), then with probability at least 1o(1)1-o(1), Ψ\Psi misclassifies at least an inter-class edge.

The proof of Theorem 9 relies on analyzing the behavior of the Bayes optimal classifier in (11). We compute an upper bound on the probability with which the optimal classifier correctly classifies an arbitrary inter-class edge. Then the proof of part 1 of Theorem 9 follows from a concentration argument for the fraction of inter-class edges that are misclassified by the optimal classifier. For part 2, we use a similar concentration argument to choose a suitable threshold for qq that forces the optimal classifier to fail on at least one inter-class edge. We provide formal arguments in the appendix.

As a motivating example of how the attention mechanism would fail and what exactly the attention coefficients would behave in this regime, we focus on one of the most popular attention architecture [44], where α\alpha is a single-layer neural network parameterized by (𝒘,𝒂,b)d×2×(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R} with LeakyRelu\mathrm{LeakyRelu} activation function. Namely, the attention coefficients are defined by

γij=defexp(LeakyRelu(𝒂T[𝒘T𝐗i𝒘T𝐗j]+b))Niexp(LeakyRelu(𝒂T[𝒘T𝐗i𝒘T𝐗]+b)).\gamma_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\exp\left(\mathrm{LeakyRelu}\left({\boldsymbol{a}}^{T}\begin{bmatrix}\boldsymbol{w}^{T}\mathbf{X}_{i}\\ \boldsymbol{w}^{T}\mathbf{X}_{j}\end{bmatrix}+b\right)\right)}{\sum_{\ell\in N_{i}}\exp\left(\mathrm{LeakyRelu}\left({\boldsymbol{a}}^{T}\begin{bmatrix}\boldsymbol{w}^{T}\mathbf{X}_{i}\\ \boldsymbol{w}^{T}\mathbf{X}_{\ell}\end{bmatrix}+b\right)\right)}. (12)

We show that, as a consequence of the inability of the attention mechanism to distinguish intra-class and inter-class edges, with overwhelming probability most of the attention coefficients γij\gamma_{ij} given by (12) are going to be Θ(1/|Ni|)\Theta(1/|N_{i}|). In particular, Theorem 10 says that for the vast majority of nodes in the graph, the attention coefficients on most edges are uniform irrespective of whether the edge is inter-class or intra-class. As a result, this means that the attention mechanism is unable to assign higher weights to important edges and lower weights to unimportant edges.

Theorem 10.

Assume that 𝛍Kσ\|\boldsymbol{\mu}\|\leq K\sigma and σK\sigma\leq K^{\prime} for some absolute constants KK and KK^{\prime}. Moreover, assume that the parameters (𝐰,𝐚,b)d×2×(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R} are bounded. Then, with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), there exists a subset 𝒜[n]\mathcal{A}\subseteq[n] with cardinality at least n(1o(1))n(1-o(1)) such that for all i𝒜i\in\mathcal{A} the following hold:

  1. 1.

    There is a subset Ji,0NiC0J_{i,0}\subseteq N_{i}\cap C_{0} with cardinality at least 910|NiC0|\frac{9}{10}|N_{i}\cap C_{0}|, such that γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|) for all jJi,0j\in J_{i,0}.

  2. 2.

    There is a subset Ji,1NiC1J_{i,1}\subseteq N_{i}\cap C_{1} with cardinality at least 910|NiC1|\frac{9}{10}|N_{i}\cap C_{1}|, such that γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|) for all jJi,1j\in J_{i,1}.

Theorem 10 is proved by carefully computing the numerator and the denominator in (12). In this regime, 𝝁\|\boldsymbol{\mu}\| is not much larger than σ\sigma, that is, the signal does not dominate noise, so the numerator in (12) is not indicative of the class memberships of nodes i,ji,j but rather acts like Gaussian noise. On the other hand, denote the denominator in (12) by δi\delta_{i} and observe that it is the same for all γi\gamma_{i\ell} where Ni\ell\in N_{i}. Using concentration arguments about {𝒘T𝐗}[n]\{\boldsymbol{w}^{T}\mathbf{X}_{\ell}\}_{\ell\in[n]} yields γij=Θ(1/δi)\gamma_{ij}=\Theta(1/\delta_{i}) and δi=Θ(|Ni|)\delta_{i}=\Theta(|N_{i}|) finishes up the proof. We provide details in the appendix.

Compared to the easy regime, it is difficult to obtain a separation result for the nodes without additional assumptions. In the easy regime, the distance between the means was much larger than the standard deviation, which made the “signal” (the expectation of the convolved data) dominate the “noise” (i.e., the variance of the convolved data). In the hard regime the “noise” dominates the “signal”. Thus, we conjecture the following.

Conjecture 11.

There is an absolute constant M>0M>0 such that, whenever 𝛍Mσlognn(p+q)(1max(p,q))p+q|pq|\|\boldsymbol{\mu}\|\leq M\cdot\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\cdot\frac{p+q}{|p-q|}, every graph attention model fails to perfectly classify the nodes with high probability.

The above conjecture means that in the hard regime, the performance of the graph attention model depends on qq as opposed to the easy regime, where in Corollary 5 we show that it doesn’t. This property is verified by our synthetic experiments in Section 5. The quantity σlognn(p+q)(1max(p,q))\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))} in the threshold comes from our conjecture that the expected maximum “noise” of the graph attention convolved data over the nodes is at least cσlognn(p+q)(1max(p,q))c\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))} for some constant c>0c>0. The quantity p+q|pq|\frac{p+q}{|p-q|} in the threshold comes from our conjecture that the distance between the means (i.e. “signal”) of the graph attention convolved data is reduced to at most |pq|/(p+q)|p-q|/(p+q) of the original distance. Proving Conjecture 11 would require delicate treatment of the correlations between the attention coefficients γij\gamma_{ij} and the node features 𝐗i\mathbf{X}_{i} for i[n]i\in[n].

3.2.1 Are good attention coefficients helpful in the “hard regime”?

In this subsection we are interested in understanding the implications of edge separability on node separability in the hard regime and when Ψ\Psi is restricted to a specific class of functions. In particular, we show that Conjecture 11 is true under an additional assumption that Ψ\Psi does not depend on the node features. In addition, we show that even if we were allowed to use an “extremely good” attention function Ψ~\tilde{\Psi} which separates the edges with an arbitrarily large margin, with high probability the graph attention convolution (2) will still misclassify at least one node as long as 𝝁/σ\|\boldsymbol{\mu}\|/\sigma is sufficiently small.

We consider the class of functions Ψ~\tilde{\Psi} which can be expressed in the following form:

Ψ~(i,j)={sign(pq)t,if (i,j) is an intra-class edge,sign(pq)t,if (i,j) is an inter-class edge,\tilde{\Psi}(i,j)=\left\{\begin{array}[]{ll}\hskip 9.95845pt\operatorname{sign}(p-q)t,&\mbox{if $(i,j)$ is an intra-class edge},\\ -\operatorname{sign}(p-q)t,&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right. (13)

for some t0t\geq 0. The particular class of functions in (13) is motivated by the property of the ideal edge classifier in (10) and the behavior of Ψ\Psi in (6) when it is applied to the means of the Gaussians. There are a few possible ways to obtain a function Ψ~\tilde{\Psi} which satisfies (13). For example, in the presence of good edge features which reflect the class memberships of the edges, we can make Ψ~\tilde{\Psi} take as input the edge features. Moreover, if |pq|>2logn/n|\sqrt{p}-\sqrt{q}|>\sqrt{2\log n/n}, one such Ψ~\tilde{\Psi} may be easily realized from the eigenvectors of the graph adjacency matrix. By the exact spectral recovery result in Lemma 12, we know that there exists a classifier τ^\hat{\tau} which separates the nodes. Therefore, we can set Ψ~(i,j)=sign(pq)t\tilde{\Psi}(i,j)=\operatorname{sign}(p-q)t if τ^(i)=τ^(j)\hat{\tau}(i)=\hat{\tau}(j) and Ψ~(i,j)=sign(pq)t\tilde{\Psi}(i,j)=-\operatorname{sign}(p-q)t otherwise.

Lemma 12 (Exact recovery in [1]).

Suppose that p,q=Ω(log2n/n)p,q=\Omega(\log^{2}n/n) and |pq|>2logn/n|\sqrt{p}-\sqrt{q}|>\sqrt{2\log n/n}. Then there exists a classifier τ^\hat{\tau} taking as input the graph 𝐀\mathbf{A} and perfectly classifies the nodes with probability at least 1o(1)1-o(1).

Theorem 13.

Suppose that p,qp,q satisfy Assumption 1 and that p,qp,q are bounded away from 1. For every ϵ>0\epsilon>0, there are absolute constants M,M=O(ϵ)M,M^{\prime}=O(\sqrt{\epsilon}) such that, with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the graph attention convolution in (2) and the attention architecture Ψ~\tilde{\Psi} in (13), the model misclassifies at least Ω(n1ϵ)\Omega(n^{1-\epsilon}) nodes for any 𝐰\boldsymbol{w} such that 𝐰=1\|\boldsymbol{w}\|=1, if

  1. 1.

    t=O(1)t=O(1) and 𝝁Mσlognn(p+q)(1max(p,q))p+q|pq|\|\boldsymbol{\mu}\|\leq M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|};

  2. 2.

    t=ω(1)t=\omega(1) and 𝝁Mσlognn(p+q)(1max(p,q))\|\boldsymbol{\mu}\|\leq M^{\prime}\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}.

Theorem 13 warrants some discussions. We start with the role of tt in the attention function (13). One may think of tt as the multiplicative margin of separation for intra-class and intra-class edges. When t=O(1)t=O(1), the margin of separation is at most a constant. This includes the special case when Ψ~(i,j)=0\tilde{\Psi}(i,j)=0 for all (i,j)E(i,j)\in E, i.e, the margin of separation is 0. In this case, the graph attention convolution in (2) reduces to the standard graph convolution with uniform averaging among the neighbors. Therefore, part 1 of Theorem 13 also applies to the standard graph convolution. On the other hand, when t=ω(1)t=\omega(1), the margin of separation is not only bounded away from 0 but also it grows with nn.

Next, we discuss the additional assumption that p,qp,q are bounded away from 1. This assumption is used to obtain a concentration result required for the proof of Theorem 13. It is also intuitive in the following sense. If both pp and qq are arbitrarily close to 1, then after the convolution the convolved node feature vectors collapse to approximately a single point, and thus this becomes a trivial case where no classifier is able to separate the nodes; on the other hand, if pp is arbitrarily close to 1 and qq is very small, then after the convolution the convolved node feature vectors collapse to approximately one of two points according to which class the node comes from, and in this case the nodes can be easily separated by a linear classifier.

We now focus on the threshold for 𝝁\|\boldsymbol{\mu}\| under which the model is going to misclassify at least one node with high probability. In part 1 of Theorem 13, t=O(1)t=O(1), i.e., the attention mechanism Ψ~\tilde{\Psi} is either unable to separate the edges or unable to separate the edges with a large enough margin. In this case, one can show that all attention coefficients are Θ(1n(p+q))\Theta(\frac{1}{n(p+q)}). Consequently, the quantity |pq||p-q| appears in denominator of the threshold for 𝝁\|\boldsymbol{\mu}\| in part 1 of Theorem 13. Because of that, if pp and qq are arbitrarily close, then the model is not able to separate the nodes irrespective of how large 𝝁\|\boldsymbol{\mu}\| is. For example, treating 1max(p,q)1-\max(p,q) as a constant since pp and qq are bounded away from 1 by assumption, we have that

|pq|=o(p+qn)impliesMσlognn(p+q)(1max(p,q))p+q|pq|=ω(σlogn).|p-q|=o\left(\sqrt{\frac{p+q}{n}}\right)\ \mbox{implies}\ M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|}=\omega(\sigma\sqrt{\log n}).

This means that if pp and qq are close enough, every attention function Ψ~\tilde{\Psi} in the form of (13) and t=O(1)t=O(1) cannot help classify all nodes correctly even if 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}). On the contrary, recall that in the easy regime where 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}), the attention mechanism given in (3) and (4) helps separate the nodes with high probability. This illustrates the limitation of every attention mechanism in the form of (13) which have an insignificant margin of separation. According to Theorem 10, the vast majority of attention coefficients are uniform, and thus in Conjecture 11 we expect that graph attention in general shares similar limitations in the hard regime.

In part 2 of Proposition 13, t=ω(1)t=\omega(1), i.e., the attention mechanism Ψ~\tilde{\Psi} separates the edges with a large margin. In this case, one can show that the attention coefficients on important edges (e.g. intra-class edges) are exponentially larger than those on unimportant edges (e.g. inter-class edges). Consequently, the factor (p+q)/|pq|(p+q)/|p-q| no longer appears in the threshold for 𝝁\|\boldsymbol{\mu}\| in part 2 of Theorem 13. However, at the same time, the threshold also implies that, even when we have a perfect attention mechanism that is able to separate the edges with a large margin, as long as 𝝁/σ\|\boldsymbol{\mu}\|/\sigma is small enough, then the model is going to misclassify at least one node with high probability.

Finally, notice that in Theorem 13 the parameter ϵ\epsilon captures a natural tradeoff between the threshold for 𝝁\|\boldsymbol{\mu}\| and the lower bound on the number of misclassified nodes. Namely, the smaller the ϵ\epsilon is, the smaller the threshold for 𝝁\|\boldsymbol{\mu}\| becomes, and hence the less signal there is in the node features, the more mistakes the model is going to make. This is precisely demonstrated by the scaling of M,M=O(ϵ)M,M^{\prime}=O(\sqrt{\epsilon}) and misclassification bound Ω(n1ϵ)\Omega(n^{1-\epsilon}) with respect to ϵ\epsilon. We leave the proof of Theorem 13 to the appendix.

4 Robustness to structural noise and its implications beyond perfect node classification

In this section, we provide a positive result on the capacity of graph attention convolution for node classification beyond the perfect classification regime. In particular, we show that independent of the parameters of CSBM, i.e., independent of pp, qq, 𝝁\boldsymbol{\mu} and σ\sigma, the two-layer MLP attention architecture Ψ\Psi from Section 3.1 is able to achieve the classification performance obtainable by the Bayes optimal classifier for node features. This is proved by showing that there is a parameter setting for Ψ\Psi where the attention coefficient on self-loops can be made exponentially large. Consequently, the corresponding graph attention convolution behaves like a linear function of node features. We provide two corollaries of this result. The first corollary provides a ranking of graph attention convolution, simple graph convolution, and linear function in terms of their ability to classify nodes in CSBM. More specifically, by noticing that the simple graph convolution is also realized by a specific parameter setting of the attention architecture Ψ\Psi, our result implies that the performance of graph attention convolution for node classification in CSBM is lower bounded by the best possible performance between a linear classifier and simple graph convolution. In particular, graph attention is strictly more powerful than simple graph convolution when the graph is noisy (e.g. when pqp\approx q), and it is strictly more powerful than a linear classifier when the graph is less noisy (e.g. when pp and qq are significantly different). The second corollary provides perfect classification, almost perfect classification, and partial classification results for graph attention convolution. It follows immediately from the reduction of graph attention convolution to a linear function under the specification of Ψ\Psi that we will discuss. In what follows we start with high-level ideas, then we present formal statements of the results, and we discuss the implications in detail.

Recall the two-layer MLP attention architecture Ψ\Psi from (3) and (4) is equivalently written in (5) as

Ψ(𝐗i,𝐗j)={2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|,2R(1β)sign(𝒘~T𝐗i)𝒘~T𝐗j,if|𝒘~T𝐗i|<𝒘~T𝐗j<|𝒘~T𝐗i|,2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

We make the following observations. Assuming that the scaling parameter R>0R>0,

  • If 𝒘~T𝐗i>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0, then the function Ψ\Psi assigns more weight to an edge (i,j)(i,j) such that 𝒘~T𝐗j>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0 than an edge (i,j)(i,j^{\prime}) such that 𝒘~T𝐗j<0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}}<0;

  • If 𝒘~T𝐗i<0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0, then the function Ψ\Psi assigns more weight to an edge (i,j)(i,j) such that 𝒘~T𝐗j<0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0 than an edge (i,j)(i,j^{\prime}) such that 𝒘~T𝐗j>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}}>0;

  • If 𝒘~T𝐗i=0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}=0, then the function Ψ\Psi assigns uniform weight to every edge (i,j)(i,j).

This means that the behavior of Ψ\Psi depends on which side of the hyperplane {𝒙:𝒘~T𝒙=0}\{\boldsymbol{x}:\tilde{\boldsymbol{w}}^{T}\boldsymbol{x}=0\} that 𝐗i\mathbf{X}_{i} comes from. In other words, for fixed 𝐗j\mathbf{X}_{j}, whether the attention function Ψ\Psi will up-weight or down-weight an edge (i,j)(i,j) depends entirely on the classification of 𝐗i\mathbf{X}_{i} based on the linear classifier 𝒘~\tilde{\boldsymbol{w}}. Moreover, note that the attention function value satisfies

2R(1β)max{|𝒘~T𝐗i|,|𝒘~T𝐗j|}Ψ(𝐗i,𝐗j)2R(1β)min{|𝒘~T𝐗i|,|𝒘~T𝐗j|}.2R(1-\beta)\cdot\max\{-|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|,-|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\}\leq\Psi(\mathbf{X}_{i},\mathbf{X}_{j})\leq 2R(1-\beta)\cdot\min\{|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|,|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\}.

Therefore, out of all unit norm vectors 𝒘\boldsymbol{w}, our choice 𝒘~=𝝁/𝝁\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\| maximizes the range of values that Ψ\Psi can output. Recall from Lemma 6 that 𝒘~\tilde{\boldsymbol{w}} also happens to characterize the Bayes optimal classifier for the node features. Finally, the attention function Ψ\Psi achieves minimum/maximum at self-attention, i.e.

Ψ(𝐗i,𝐗i)\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{i}) =2R(1β)|𝒘~T𝐗i|=maxj[n]Ψ(𝐗i,𝐗j),\displaystyle=2R(1-\beta)|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|=\max_{j\in[n]}\Psi(\mathbf{X}_{i},\mathbf{X}_{j}),
Ψ(𝐗i,𝐗i)\displaystyle\Psi(\mathbf{X}_{i},-\mathbf{X}_{i}) =2R(1β)|𝒘~T𝐗i|=minj[n]Ψ(𝐗i,𝐗j).\displaystyle=-2R(1-\beta)|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}|=\min_{j\in[n]}\Psi(\mathbf{X}_{i},\mathbf{X}_{j}).

A consequence of these observations is that, by setting the scaling parameter RR sufficiently large, one can make exp(Ψ(𝐗i,𝐗j))\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j})) exponentially larger than exp(Ψ(𝐗i,𝐗j))\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j^{\prime}})) for any j,jj,j^{\prime} such that sign(𝒘~T𝐗j)=sign(𝒘~T𝐗i)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}) and sign(𝒘~T𝐗j)sign(𝒘~T𝐗i)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}})\neq\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}). According to the definition of attention coefficients in (1), this means that the attention coefficients γij\gamma_{ij} where sign(𝒘~T𝐗j)=sign(𝒘~T𝐗i)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}) are going to be exponentially larger than the attention coefficients γij\gamma_{ij^{\prime}} where sign(𝒘~T𝐗j)sign(𝒘~T𝐗i)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j^{\prime}})\neq\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}). Therefore, one could expect that in this case, if the linear classifier 𝒘~\tilde{\boldsymbol{w}} correctly classifies 𝐗i\mathbf{X}_{i} for some i[n]i\in[n], e.g., for iC1i\in C_{1} this means that 𝒘~T𝐗i>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0, then graph attention convolution output hi=jNiγij𝒘~T𝐗jh_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j} also satisfies hi>0h_{i}^{\prime}>0, due to sufficiently larger attention coefficients γij\gamma_{ij} for which 𝒘~T𝐗j>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0. We state the result below in Theorem 14 and leave detailed arguments in the appendix.

Theorem 14.

With probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) with R=Ω(nlog2n/σ)R=\Omega(n\log^{2}n/\sigma), the graph attention convolution output satisfies

hi=jNiγij𝒘~T𝐗j>0if and only if𝒘~T𝐗i>0,i[n],\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,\;\forall i\in[n],
hi=jNiγij𝒘~T𝐗j<0if and only if𝒘~T𝐗i<0,i[n].\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0,\;\forall i\in[n].

Theorem 14 means that there is a parameter setting for the attention architecture Ψ\Psi such that the performance of graph attention convolution matches with the performance of the Bayes optimal classifier for node features. This shows the ability of graph attention to “ignore” the graph structure, which can be beneficial when the graph is noisy. For example, if p=qp=q, then it is easy to see that simple graph convolution completely mixes up the node features, making it not possible to achieve any meaningful node classification result. On the other hand, as long as there is some signal from the original node features, i.e. 𝝁>0\|\boldsymbol{\mu}\|>0, then graph attention will be able to pick that up and classify the nodes at least as good as the best classifier for the node features alone. It is also worth noting that by setting R=0R=0, the attention function Ψ\Psi has a constant value, and hence graph attention convolution reduces to the standard graph convolution, which has been shown to be useful in the regime where the original node features are not very strong but the graph has a nice structure [7]. For example, when there is a significant gap between pp and qq, |pq|/(p+q)=Ω(1)|p-q|/(p+q)=\Omega(1), then setting R=0R=0 could significantly improve the node separability threshold over the best linear classifier [7]. This shows the robustness of graph attention against noise in one of the two sources of information, namely node features and edge connectivities. Unlike the Bayes optimal classifier for node features which is sensitive to feature noise or the simple graph convolution which is sensitive to structural noise, one can expect graph attention to work as long as at least one of the two sources of information has a good signal. Therefore, we obtain a ranking of node classification models among graph attention convolution, simple graph convolution, and a linear classifier. We state this below in Corollary 15.

Corollary 15.

The node classification performance obtainable by graph attention convolution is lower bounded by the best possible node classification performance between a simple graph convolution and a linear classifier.

By a straightforward characterization of the performance of the linear classifier 𝒘~=𝝁/𝝁\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|, we immediately obtain the following classification results in Corollary 16. Recall that we denoted Φ\Phi as the cumulative distribution function of the standard normal distribution. Write

𝝁=κσfor someκ>0.\|\boldsymbol{\mu}\|=\kappa\sigma\;\ \mbox{for some}\;\ \kappa>0.
Corollary 16.

With probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) with R=Ω(nlog2n/σ)R=\Omega(n\log^{2}n/\sigma), one has that

  • (Perfect classification) If κ2logn\kappa\geq\sqrt{2\log n} then all nodes are correctly classified;

  • (Almost perfect classification) If κ=ω(1)\kappa=\omega(1) then at least 1o(1)1-o(1) fraction of all nodes are correctly classified;

  • (Partial classification) If κ=O(1)\kappa=O(1) then at least Φ(κ)o(1)\Phi(\kappa)-o(1) fraction of all nodes are correctly classified.

Interestingly, we note that the perfect classification result in Corollary 16 is nearly equivalent (up to a small order in the threshold κ\kappa) to the perfect classification result in Corollary 5 from Section 3.1. They are obtained from different behaviors of the attention coefficients. This shows that there could be more than one type of “good” attention coefficients that are able to deliver good node classification performance.

5 Experiments

In this section, we demonstrate empirically our results on synthetic and real data. The parameters of the models that we experiment with are set by using an ansatz based on our theorems. The particular details are given in Section 5.1. For real datasets, we use the default splits which come from PyTorch Geometric [19] and OGB [25]. In all our experiments we use MLP-GAT, where the attention mechanism Ψ\Psi is set to be the two-layer network in (3) and (4) with R=1R=1. For synthetic experiments using CSBM with known pp and qq, we use the variant that takes p,qp,q into account, Ψ=(𝟏pq𝟏p<q)Ψ\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\Psi. In Figure 2(b) and Figure 3(b) we additionally consider the original GAT architecture of [44] to demonstrate Theorem 10.

5.1 Ansatz for GAT, MLP-GAT and GCN

For the original GAT architecture we fix 𝒘=𝝁/𝝁\boldsymbol{w}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\| and define the first head as 𝒂1=12(1,1)\boldsymbol{a}_{1}=\frac{1}{\sqrt{2}}(1,1) and b1=12𝒘T𝝁b_{1}=-\frac{1}{\sqrt{2}}\boldsymbol{w}^{T}\boldsymbol{\mu}; The second head is defined as 𝒂2=𝒂1\boldsymbol{a}_{2}=-\boldsymbol{a}_{1} and b2=b1b_{2}=-b_{1}. We now discuss the choice of such an ansatz. The parameter 𝒘\boldsymbol{w} is picked based on the optimal Bayes classifier without a graph, and the attention is set such that the first head maintains intra-class edges in C1C_{1} and the second head maintains intra-class edges in C0C_{0}. Note that for the original GAT [44], due to the fact that the attention mechanism consists of just one layer (i.e. a nonlinear activation applied on a linear transformation, see (12)), it is not possible for the original GAT to keep only γij\gamma_{ij} which correspond to intra-class edges. More specifically, one may use the same techniques in the proof of Theorem 3 and Corollaries 4 and 5 to prove the node separability results for the original GAT. In this particular case, the result will depend on qq in contrast to the result we get for MLP-GAT, where no dependence of qq was needed. For MLP-GAT we use the ansatz Ψ=(𝟏pq𝟏p<q)Ψ\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\Psi where Ψ\Psi is given in (3) and (4) with R=1R=1. This choice of two-layer network allows us to bypass the “XOR problem” [38] and separate inter-class from intra-class edges as shown in Theorem 3. Note that no single-layer architecture will be able to separate the edges due to the “XOR problem”. For GCN we used the ansatz from [7] which is also 𝒘=𝝁/𝝁\boldsymbol{w}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|.

5.2 Synthetic data

We use the CSBM to generate the data. In a recent work [40], a simple variant of the CSBM was also chosen as the default generative model for benchmarking GNN performance for node classification tasks. We present two sets of experiments. In the first set, we fix the distance between the means and vary qq, and in the second set, we fix qq and vary the distance. We set n=1000n=1000, d=n/log2(n)d=n/\log^{2}(n), p=0.5p=0.5 and σ=0.1\sigma=0.1. Results are averaged over 1010 trials.

5.2.1 Fixing the distance between the means and varying qq

We consider the two regimes separately, where for the “easy regime” we fix the mean 𝝁\boldsymbol{\mu} to be a vector where each coordinate is equal to 10σlogn2/2d10\sigma\sqrt{\log{n^{2}}}/2\sqrt{d}. This guarantees that the distance between the means is 10σlogn210\sigma\sqrt{\log{n^{2}}}. In the “hard regime” we fix the mean 𝝁\boldsymbol{\mu} to a vector where each coordinate is equal to σ/d\sigma/\sqrt{d}, and this guarantees that the distance is σ\sigma. We fix p=0.5p=0.5 and vary qq from log2(n)/n\log^{2}(n)/n to 1log2(n)/n1-\log^{2}(n)/n.

In Figure 1 we illustrate Theorem 3 and Corollaries 45 for the easy regime, and in Figure 2 we illustrate Theorem 9 and Theorem 10 for the hard regime. In particular, in Figure 1(a) we show Theorem 3, MLP-GAT is able to classify intra-class and inter-class edges perfectly. In Figure 1(b) we show that in the easy regime, as claimed in Corollary 4 for MLP-GAT, when pqp\geq q, the γ\gamma that correspond to intra-class edges concentrate around 2/np2/np, while the γ\gamma for the inter-class edges concentrate to tiny values; when p<qp<q, we see the opposite, that is the γ\gamma that correspond to intra-class edges concentrate to tiny values, while the γ\gamma for the inter-class edges concentrate around 2/nq2/nq. In Figure 1(c) we observe that the performance of MLP-GAT for node classification is independent of qq in the easy regime as claimed in Corollary 5. However, in this plot, we observe that not using the graph also achieves perfect node classification, a result which is proved in Proposition 7. In the same plot, we also show the performance of simple graph convolution, where its performance depends on qq (see [7]). In Figure 2(a) we show Theorem 9. MLP-GAT misclassifies a constant fraction of the intra and inter edges as proved in Theorem 9. In Figure 2(b) we show Theorem 10, where γ\gamma’s in the hard regime concentrate around uniform (GCN) coefficients for both MLP-GAT and GAT. In Figure 2(c) we illustrate that node classification accuracy is a function of qq for MLP-GAT. This is conjectured in Conjecture 11. On the other hand, note that the performance of MLP-GAT is lower bounded by the performance of not using the graph, as proved in Theorem 14.

Refer to caption
(a) Edge classification
Refer to caption
(b) Attention coefficients
Refer to caption
(c) Node classification
Figure 1: Demonstration of Theorem 3 and Corollaries 45 for the easy regime. The shaded areas in the plots show standard deviation. Unlike GCN, the performance of MLP-GAT does not degrade as we vary qq.
Refer to caption
(a) Edge classification
Refer to caption
(b) Attention coefficients
Refer to caption
(c) Node classification
Figure 2: Demonstration of Theorem 9 and Theorem 10 for the hard regime. The shaded areas in the plots show standard deviation. Unlike GCN, the performance of MLP-GAT is lower bounded by the performance of not using the graph.

5.2.2 Fixing qq and varying the distance between the means

We consider the case where q=0.1q=0.1. In Figure 3 we show how the attention coefficients of MLP-GAT and GAT, the node and edge classification depend on the distance between the means. We also add a vertical line at σ\sigma to approximately separate the easy (left of σ\sigma) and hard (right of σ\sigma) regimes. Figure 3(c) illustrates Theorems 3 and 9 in the hard and easy regimes, respectively. In particular, we observe that in the hard regime, MLP-GAT fails to distinguish intra from inter edges, while in the easy regime, it is able to do that perfectly for a large enough distance between the means.

In Figure 3(a) we observe that in the hard regime, γ\gamma concentrate around the uniform (GCN) coefficients, while in the easy regime, MLP-GAT is able to maintain the γ\gamma for intra-class edges, while it sets the γ\gamma to tiny values for inter-class edges. In Figure 3(b). we observe that in the hard regime, the γ\gamma of GAT concentrate around the uniform coefficients (proved in Theorem 10), while in the easy regime although the γ\gamma concentrate, GAT is not able to distinguish intra from inter edges. This makes sense since the separation of edges can’t be done by simple linear classifiers used by GAT, see the discussion below Theorem 10. Finally, in Figure 3(d) we show node classification results for MLP-GAT. In the easy regime, we observe perfect classification as proved in Corollary 5. However, as the distance between the means decreases, we observe that MLP-GAT starts to misclassify nodes.

Refer to caption
(a) Attention coefficients of MLP-GAT
Refer to caption
(b) Attention coefficients of GAT
Refer to caption
(c) Edge classification accuracy
Refer to caption
(d) Node classification accuracy
Figure 3: Attention coefficients of MLP-GAT and GAT, node and edge classification as a function of the distance between the means. Shaded areas show standard deviation. When there is a sufficient distance between the means, attention coefficients of MLP-GAT demonstrate a nice separation while GAT does not.

5.3 Real data

For the experiment on real data, we illustrate the attention coefficients, node and edge classification for MLP-GAT as a function of the distance between the means. We use popular real-world graph datasets Cora, PubMed, and CiteSeer collected by PyTorch Geometric [19] and ogbn-arxiv from Open Graph Benchmark [25]. The datasets come with multiple classes, however, for each of our experiments, we do a one-v.s.-all classification for a single class. This is a semi-supervised problem, only a fraction of the training nodes have labels. The rest of the nodes are used for measuring prediction accuracy. To control the distance between the means of the problem we use the true labels to determine the class of each node and then we compute the empirical mean for each class. We subtract the empirical means from their corresponding classes and we also add means 𝝁\boldsymbol{\mu} and 𝝁-\boldsymbol{\mu} to each class, respectively. This modification can be thought of as translating the mean of the distribution of the data for each class.

The results of this experiment are shown in Figure 4. For Cora, PubMed, and CiteSeer we show results for class 0 since these are small datasets, each dataset contains at most 7 classes, and the classes have similar sizes. In our experiments on other classes, we observed that the results are similar. For ogbn-arxiv we show results for the largest class (i.e. class 16) since this is a larger dataset which has 40 imbalanced classes. Picking a large class makes the one-v.s.-all classification task more balanced. In our experiments on other classes having similar sizes, we obtained similar results. We note that in the real data, we also observe similar behavior of MLP-GAT in the easy and hard regimes as for the synthetic data. In particular, for all datasets as the distance of means increases, MLP-GAT is able to accurately classify intra-class and inter-class edges, see Figures 4(a)4(d) and 4(g). Moreover, as the distance between the means increases, the average intra-class γ\gamma becomes much larger than the average inter-class γ\gamma, see Figures 4(b)4(e)4(h), and 4(k), and the model is able to classify the nodes accurately, see Figures 4(c)4(f)4(i), and 4(l). On the contrary, in the same figures, we observe that as the distance between the means decreases then MLP-GAT is not able to separate intra-class from inter-class edges, the averaged γ\gamma are very close to uniform coefficients and the model can’t classify the nodes accurately.

Refer to caption
(a) Edge class., Cora
Refer to caption
(b) Attention coef., Cora
Refer to caption
(c) Node class., Cora
Refer to caption
(d) Edge class., PubMed
Refer to caption
(e) Attention coef., PubMed
Refer to caption
(f) Node class., PubMed
Refer to caption
(g) Edge class., CiteSeer
Refer to caption
(h) Attention coef., CiteSeer
Refer to caption
(i) Node class., CiteSeer
Refer to caption
(j) Edge class., ogbn-arxiv
Refer to caption
(k) Attention coef., ogbn-arxiv
Refer to caption
(l) Node class., ogbn-arxiv
Figure 4: Attention coefficients, node and edge classification for MLP-GAT as a function of the distance between the means for real data.

Note that Figure 4 does not show the standard deviation for the attention coefficients γ\gamma. We show the standard deviation of γ\gamma in Figure 5. We observe that the standard deviation is higher than what we observed in the synthetic data. In particular, it can be more than half of the averaged γ\gamma. This is to be expected since for the real data the degrees of the nodes do not concentrate as well. In Figure 5 we show that the standard deviation of the uniform coefficients 1/|Ni|1/|N_{i}| is also high. For Cora, PubMed, and CiteSeer, the standard deviation for intra-class γ\gamma is similar to that of 1/|Ni|1/|N_{i}|, while the deviation for inter-class γ\gamma is large for a small distance between the means, but it gets much smaller as the distance increases. For ognb-arxiv, the standard deviation of 1/|Ni|1/|N_{i}| is particularly high. This implies that the degree distribution of nodes of ogbn-arxiv has a heavy tail, which could potentially result from the graph structure being noisier than other datasets and also explain GCN’s relatively much worse performance when the distance between the means is large.

Refer to caption
(a) Cora
Refer to caption
(b) PubMed
Refer to caption
(c) CiteSeer
Refer to caption
(d) ogbn-arxiv
Figure 5: Standard deviation for attention coefficients of MLP-GAT.

6 Summary of implications for practitioners

While this work focuses on theoretical understanding of graph attention’s capability for edge and node classification using the CSBM generative model, our findings yield a series of potential suggestions for GNN practitioners. In this section we provide some interesting practical implications of our results.

6.1 Why graph attention? Benefits of graph attention’s robustness to structural noise

When the graph is very noisy, e.g. when a node has a similar number of neighbors from the same class and from different classes, simple graph convolution will mix up node features and thus make nodes from different classes indistinguishable. In this case, simple graph convolution can do more harm than good. However, in practice, it is difficult to determine how noisy the graph is, or if the graph is even useful at all. This could pose a challenge in choosing an architecture. Our results in Theorem 14 and Corollary 15 imply that graph attention has the ability to dramatically reduce the impact of a noisy graph, in a way such that the output is at least as good as the output from the best linear classifier (on the input) which does not rely on the graph. This shows that graph attention convolution is more robust against structural noise in the graph, and hence on noisy graphs, it is strictly better than simple graph convolution.

6.2 Which attention architecture? Benefits of multi-layer attention architecture

In this work, we are able to obtain positive results for graph attention by using the two-layer MLP attention architecture in (3). This is different from the original GAT which uses a single-layer attention architecture [44]. In our analyses and empirical experiments, we found that the original single-layer attention does not have the important properties required for obtaining positive results (e.g. Theorem 3, Corollary 5, Theorem 14, Corollary 16) for graph attention convolution. Coincidentally, this aligns with the findings of [11], where the authors study limitations of the original GAT architecture from a different perspective than ours, and they propose a new architecture termed GATv2. The two-layer MLP attention architecture that we consider in (3) encompasses GATv2 as a special case. Therefore, the two-layer MLP attention architecture can be a good candidate to consider when practitioners search for a suitable graph attention architecture for their specific downstream tasks. On the other hand, our results in Section 3.2 imply that even the two-layer MLP attention architecture (and hence GATv2) has limitations when the node features are noisy. To fix that, a potential solution is to incorporate additional information such as edge features, which we discuss next.

6.3 Will additional information help? Benefits of incorporating good edge features

Even though we do not consider edge features in our analyses, our results in Theorem 13 imply that good attention functions that are able to classify the edges independently from the node features can be very helpful, as they help reduce the threshold under which graph attention convolution would fail to separate the nodes. One potential way to obtain good attention functions that behave like the one given in (13) is by incorporating good edge features. Furthermore, given our result in Theorem 9, which says that graph attention based on noisy node features cannot perfectly classify the edges, the importance of incorporating informative edge features that are more indicative of edge class memberships (if they are accessible in practice) into the attention mechanism is more pronounced.

7 Conclusion and future work

In this work, we study the impact of graph attention on edges and its implications for node classification. We show that graph attention improves robustness to noise in the graph structure. We also show that graph attention may not be very useful in a “hard” regime where the node features are noisy. Our work shows that single-layer graph attention convolution has limited power at distinguishing intra-class from inter-class edges. Given the empirical successes of graph attention and its many variants, a promising future work is to study the power of multi-layer graph attention convolutions for distinguishing intra-class and inter-class edges. Moreover, our negative results in Section 3.2 for edge/node classification pertains to perfect classification and almost perfect classification. In practice, misclassification of a small constant fraction of nodes/edges is often inevitable, but nonetheless useful. Therefore, an interesting future line of work is to characterize the threshold under which graph attention is going to misclassify a certain proportion of nodes. Finally, variants of graph attention networks have been successfully used in tasks other than node classification, such as link prediction and graph classification. These tasks are typically solved by architectures that add a final aggregation layer which combines node representations generated from graph attention convolution. It is an interesting future direction to develop a good understanding of the benefits and limitations of the graph attention mechanism for those tasks.

Acknowledgement

K. Fountoulakis would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2019-04067, DGECR-2019-00147].

A. Jagannath acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199].

Appendix A Proofs

We define the following high-probability events which will be used in some proofs. Each of these events holds with probability at least 1o(1)1-o(1), which follows from straightforward applications of Chernoff bound and union bound, e.g., see [7].

Definition 17.

Define the following events over the randomness of 𝐀\mathbf{A} and {ϵi}i[n]\{{\epsilon}_{i}\}_{i\in[n]} and {𝐗i}i[n]\{\mathbf{X}_{i}\}_{i\in[n]},

  • 𝓔1\boldsymbol{\mathcal{E}}_{1} is the event that |C0|=n2±O(nlogn)|C_{0}|=\frac{n}{2}\pm O(\sqrt{n\log n}) and |C1|=n2±O(nlogn)|C_{1}|=\frac{n}{2}\pm O(\sqrt{n\log n}).

  • 𝓔2\boldsymbol{\mathcal{E}}_{2} is the event that for each i[n]i\in[n], 𝐃ii=n(p+q)2(1±10logn)\mathbf{D}_{ii}=\frac{n(p+q)}{2}\left(1\pm\frac{10}{\sqrt{\log n}}\right).

  • 𝓔3\boldsymbol{\mathcal{E}}_{3} is the event that for each i[n]i\in[n], |C0Ni|=𝐃ii(1ϵi)p+ϵiqp+q(1±10logn)|C_{0}\cap N_{i}|=\mathbf{D}_{ii}\cdot\frac{(1-{\epsilon}_{i})p+{\epsilon}_{i}q}{p+q}\left(1\pm\frac{10}{\sqrt{\log n}}\right) and |C1Ni|=𝐃ii(1ϵi)q+ϵipp+q(1±10logn)|C_{1}\cap N_{i}|=\mathbf{D}_{ii}\cdot\frac{(1-{\epsilon}_{i})q+{\epsilon}_{i}p}{p+q}\left(1\pm\frac{10}{\sqrt{\log n}}\right).

  • 𝓔4\boldsymbol{\mathcal{E}}_{4} is the event that for each i[n]i\in[n], |𝒘~T𝐗i𝐄[𝒘~T𝐗i]|10σlogn\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}\left[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right]\right|\leq 10\sigma\sqrt{\log n}.

  • 𝓔\boldsymbol{\mathcal{E}}^{*} is the intersection of the above 4 events.

Lemma 18 ([7]).

With probability at least 1o(1)1-o(1) event 𝓔\boldsymbol{\mathcal{E}}^{*} holds.

Some of our proofs also utilize the following simple observation on the mutual independence among {𝒘T𝒈i}i[n]\{\boldsymbol{w}^{T}\boldsymbol{g}_{i}\}_{i\in[n]}when {𝒈i}i[n]\{\boldsymbol{g}_{i}\}_{i\in[n]} are i.i.d. Gaussian random vectors.

Observation 19.

Fix 𝐰0\boldsymbol{w}\neq 0 in d\mathbb{R}^{d} and let 𝐠1,,𝐠n\boldsymbol{g}_{1},\ldots,\boldsymbol{g}_{n} be i.i.d. drawn from N(0,𝐈)N(0,\mathbf{I}). Then 𝐰T𝐠1,𝐰T𝐠2,,𝐰T𝐠n\boldsymbol{w}^{T}\boldsymbol{g}_{1},\boldsymbol{w}^{T}\boldsymbol{g}_{2},\ldots,\boldsymbol{w}^{T}\boldsymbol{g}_{n} are independent.

Proof:  Note that since 𝒘T𝒈iN(0,𝒘2)\boldsymbol{w}^{T}\boldsymbol{g}_{i}\sim N(0,\|\boldsymbol{w}\|^{2}), it suffices to prove that the covariance 𝐄[𝒘T𝒈i𝒘T𝒈j]=0\mathop{{\bf E}\/}[\boldsymbol{w}^{T}\boldsymbol{g}_{i}\cdot\boldsymbol{w}^{T}\boldsymbol{g}_{j}]=0 for all iji\neq j. By definition, for iji\neq j,

𝐄[𝒘T𝒈i𝒘T𝒈j]\displaystyle\mathop{{\bf E}\/}[\boldsymbol{w}^{T}\boldsymbol{g}_{i}\cdot\boldsymbol{w}^{T}\boldsymbol{g}_{j}] =𝐄[k[d][d]𝒘k𝒘𝒈ik𝒈j]=k[d][d]𝒘k𝒘𝐄[𝒈ik𝒈j]=0,\displaystyle=\mathop{{\bf E}\/}\left[\sum_{k\in[d]}\sum_{\ell\in[d]}\boldsymbol{w}_{k}\boldsymbol{w}_{\ell}\boldsymbol{g}_{ik}\boldsymbol{g}_{j\ell}\right]=\sum_{k\in[d]}\sum_{\ell\in[d]}\boldsymbol{w}_{k}\boldsymbol{w}_{\ell}\mathop{{\bf E}\/}[\boldsymbol{g}_{ik}\boldsymbol{g}_{j\ell}]=0,

where the last equality follows from independence between 𝒈i\boldsymbol{g}_{i} and 𝒈j\boldsymbol{g}_{j}.     

A.1 Proof of Theorem 3

We restate Theorem 3 for convenience.

Theorem.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}). Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) separates intra-class edges from inter-class edges.

Recall from (5) that

Ψ(𝐗i,𝐗j)={2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|,2R(1β)sign(𝒘~T𝐗i)𝒘~T𝐗j,if|𝒘~T𝐗i|<𝒘~T𝐗j<|𝒘~T𝐗i|,2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

We will condition on the event that |𝒘~T𝐗i𝐄[𝒘~T𝐗i]|10σlogn\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}\left[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right]\right|\leq 10\sigma\sqrt{\log n} for all i[n]i\in[n], which holds with probability at least 1O(1/n99)1-O(1/n^{99}) following a union bound and the Gaussian tail probability. Under this event, because 𝝁=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}), for all i,jC1i,j\in C_{1} we have

sign(𝒘~T𝐗i)=sign(𝒘~T𝐗j)=1,\displaystyle\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})=\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j})=1,
min{𝒘~T𝐗i,𝒘~T𝐗j}𝝁10σlogn,\displaystyle\min\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\geq\|\boldsymbol{\mu}\|-10\sigma\sqrt{\log n},
max{𝒘~T𝐗i,𝒘~T𝐗j}𝝁+10σlogn,\displaystyle\max\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\leq\|\boldsymbol{\mu}\|+10\sigma\sqrt{\log n},

and hence

Ψ(𝐗i,𝐗j)\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{j}) 2R(1β)min{𝒘~T𝐗i,𝒘~T𝐗j}2R(1β)(𝝁O(σlogn)),\displaystyle\geq 2R(1-\beta)\cdot\min\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\geq 2R(1-\beta)\cdot(\|\boldsymbol{\mu}\|-O(\sigma\sqrt{\log n})),
Ψ(𝐗i,𝐗j)\displaystyle\Psi(\mathbf{X}_{i},\mathbf{X}_{j}) 2R(1β)max{𝒘~T𝐗i,𝒘~T𝐗j}2R(1β)(𝝁+O(σlogn)),\displaystyle\leq 2R(1-\beta)\cdot\max\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\}\leq 2R(1-\beta)\cdot(\|\boldsymbol{\mu}\|+O(\sigma\sqrt{\log n})),

which implies that Ψ(𝐗i,𝐗j)=𝝁±O(σlogn)\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n}) for i,jC1i,j\in C_{1}. Similarly, we get that

Ψ(𝐗i,𝐗j)={𝝁±O(σlogn)=𝝁(1±o(1)),if (i,j) is an intra-class edge,𝝁±O(σlogn)=𝝁(1±o(1)),if (i,j) is an inter-class edge.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}\hskip 8.53581pt\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n})=\hskip 8.53581pt\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ -\|\boldsymbol{\mu}\|\pm O(\sigma\sqrt{\log n})=-\|\boldsymbol{\mu}\|(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right. (14)

Therefore, we probability at least 1o(1)1-o(1), we have that

sign(Ψ(𝐗i,𝐗j))={1,if (i,j) is an intra-class edge,1,if (i,j) is an inter-class edge,\displaystyle\operatorname{sign}(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))=\left\{\begin{array}[]{ll}\hskip 8.53581pt1,&\mbox{if $(i,j)$ is an intra-class edge},\\ -1,&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right.

which means perfect separability of edges, and the proof is complete.

A.2 Proof of Corollary 4

We restate Corollary 4 for convenience.

Corollary.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) and that Assumption 1 holds. Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), the attention architecture Ψ\Psi^{\prime} yields attention coefficients γij\gamma_{ij} such that

  1. 1.

    If pqp\geq q, then γij=2np(1±o(1))\gamma_{ij}=\frac{2}{np}(1\pm o(1)) if (i,j)(i,j) is an intra-class edge and γij=o(1n(p+q))\gamma_{ij}=o(\frac{1}{n(p+q)}) otherwise;

  2. 2.

    If p<qp<q, then γij=2nq(1±o(1))\gamma_{ij}=\frac{2}{nq}(1\pm o(1)) if (i,j)(i,j) is an inter-class edge and γij=o(1n(p+q))\gamma_{ij}=o(\frac{1}{n(p+q)}) otherwise.

The proof is straightforward by considering the cases pqp\geq q and p<qp<q separately. When pqp\geq q, we have Ψ=Ψ\Psi^{\prime}=\Psi. Using the specification of Ψ\Psi in (3) and (4), the definition of attention coefficients in (1), the high probability event in Lemma 18, the expression of Ψ(𝐗i,𝐗j)\Psi(\mathbf{X}_{i},\mathbf{X}_{j}) in (14), and picking RR such that 1/R=ω(σlogn)1/R=\omega(\sigma\sqrt{\log n}) and 1/R=o(𝝁)1/R=o(\|\boldsymbol{\mu}\|), we obtain the claimed results. The result when p<qp<q is obtained in the same way.

A.3 Proof of Corollary 5

We restate Corollary 5 for convenience.

Corollary.

Suppose that 𝛍=ω(σlogn)\|\boldsymbol{\mu}\|=\omega(\sigma\sqrt{\log n}) and that Assumption 1 holds. Then with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the attention architecture Ψ\Psi^{\prime} with the graph attention convolution given in (2), where ff is set to be the identify function, the model separates the nodes.

We prove the case pqp\geq q and the case p<qp<q follows analogously. Consider the attention architecture Ψ=(𝟏pq𝟏p<q)Ψ=Ψ\Psi^{\prime}=(\mathbf{1}_{p\geq q}-\mathbf{1}_{p<q})\cdot\Psi=\Psi, where Ψ\Psi is given in (3) and (4). Pick RR such that 1/R=ω(σlogn)1/R=\omega(\sigma\sqrt{\log n}) and 1/R=o(𝝁)1/R=o(\|\boldsymbol{\mu}\|)). Assume that iC1i\in C_{1}, and denote the graph attention convolution output as

hi=defjNiγij𝒘~T𝐗j.h_{i}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}.

We will condition on the event 𝓔\boldsymbol{\mathcal{E}}^{*}, which holds with probability at least 1o(1)1-o(1). By using Corollary 4 we have

jNiγij𝒘~T𝐗j\displaystyle\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j} =jC0Niγij𝒘~T𝐗j+jC1Niγij𝒘~T𝐗j\displaystyle=\sum_{j\in C_{0}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}+\sum_{j\in C_{1}\cap N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}
|C1Ni|(2np(1±o(1))(𝝁+10σlogn))\displaystyle\leq|C_{1}\cap N_{i}|\left(\frac{2}{np}(1\pm o(1))\left(\|\boldsymbol{\mu}\|+10\sigma\sqrt{\log n}\right)\right)
+|C0Ni|(o(1n(p+q))(𝝁+10σlogn))\displaystyle\qquad+|C_{0}\cap N_{i}|\left(o\left(\frac{1}{n(p+q)}\right)\left(-\|\boldsymbol{\mu}\|+10\sigma\sqrt{\log n}\right)\right)
=(1±o(1))(𝝁+10σlogn)nq(1±o(1))ω(n(p+q))(𝝁10σlogn)\displaystyle=(1\pm o(1))\cdot\left(\|\boldsymbol{\mu}\|+10\sigma\sqrt{\log n}\right)-\frac{nq(1\pm o(1))}{\omega(n(p+q))}\cdot\left(\|\boldsymbol{\mu}\|-10\sigma\sqrt{\log n}\right)
=𝝁(1±o(1)).\displaystyle=\|\boldsymbol{\mu}\|(1\pm o(1)).

Similarly, we have that

jNiγij𝒘~T𝐗j\displaystyle\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j} (1±o(1))(𝝁10σlogn)nq(1±o(1))ω(n(p+q))(𝝁+10σlogn)\displaystyle\geq(1\pm o(1))\cdot\left(\|\boldsymbol{\mu}\|-10\sigma\sqrt{\log n}\right)-\frac{nq(1\pm o(1))}{\omega(n(p+q))}\cdot\left(\|\boldsymbol{\mu}\|+10\sigma\sqrt{\log n}\right)
=𝝁(1±o(1)).\displaystyle=\|\boldsymbol{\mu}\|(1\pm o(1)).

This means that hi=𝝁(1±o(1))h_{i}^{\prime}=\|\boldsymbol{\mu}\|(1\pm o(1)) for iC1i\in C_{1}. Applying the same reasoning we get that hi=𝝁(1±o(1))h_{i}^{\prime}=-\|\boldsymbol{\mu}\|(1\pm o(1)) for iC0i\in C_{0}. Therefore, with probability at least 1o(1)1-o(1), the graph attention convolution separates the nodes.

A.4 Proof of Lemma 8

We restate Lemma 8 for convenience.

Lemma.

Let (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}) and let 𝐗ij\mathbf{X}^{\prime}_{ij} be defined as in (9). The Bayes optimal classifier for 𝐗ij\mathbf{X}^{\prime}_{ij} is realized by the following function,

h(𝒙)={0,ifpcosh(𝒙T𝝁σ2)qcosh(𝒙T𝝂σ2),1,otherwise,h^{*}(\boldsymbol{x})=\left\{\begin{array}[]{ll}0,&\text{if}\ p\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\frac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right),\\ 1,&\text{otherwise},\end{array}\right.

where 𝛍=def(𝛍𝛍)\boldsymbol{\mu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}\end{pmatrix} and 𝛎=def(𝛍𝛍)\boldsymbol{\nu}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\begin{pmatrix}\boldsymbol{\mu}\\ -\boldsymbol{\mu}\end{pmatrix}.

Proof:  Note that 𝐗ij\mathbf{X}^{\prime}_{ij} is a mixture of 2d2d-dimensional Gaussian distributions,

𝐗ij{N(𝝁,σ2𝐈)iC0,jC0N(𝝁,σ2𝐈)iC1,jC1N(𝝂,σ2𝐈)iC0,jC1N(𝝂,σ2𝐈)iC1,jC0.\mathbf{X}^{\prime}_{ij}\sim\begin{cases}N(-\boldsymbol{\mu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{0},j\in C_{0}\\ N(\boldsymbol{\mu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{1},j\in C_{1}\\ N(-\boldsymbol{\nu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{0},j\in C_{1}\\ N(\boldsymbol{\nu}^{\prime},\sigma^{2}\mathbf{I})&i\in C_{1},j\in C_{0}\end{cases}.

The optimal classifier is then given by

h(𝒙)=argmaxc{0,1}𝐏𝐫[y=c𝒙].h^{*}(\boldsymbol{x})=\operatorname*{arg\,max}_{c\in\{0,1\}}\mathop{{\bf Pr}\/}[y=c\mid\boldsymbol{x}].

Note that 𝐏𝐫[y=0]=qp+q\mathop{{\bf Pr}\/}[y=0]=\frac{q}{p+q} and 𝐏𝐫[y=1]=pp+q\operatorname{{\bf Pr}}[y=1]=\frac{p}{p+q}. Thus, by Bayes rule, we obtain that

𝐏𝐫[y=c𝒙]\displaystyle\mathop{{\bf Pr}\/}[y=c\mid\boldsymbol{x}] =𝐏𝐫[y=c]f𝒙|y(𝒙y=c)𝐏𝐫[y=0]f𝒙|y=0(𝒙y=0)+𝐏𝐫[y=1]f𝒙|y=1(𝒙y=1)\displaystyle=\frac{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=c)}{\mathop{{\bf Pr}\/}[y=0]f_{\boldsymbol{x}|y=0}(\boldsymbol{x}\mid y=0)+\mathop{{\bf Pr}\/}[y=1]f_{\boldsymbol{x}|y=1}(\boldsymbol{x}\mid y=1)}
=11+𝐏𝐫[y=1c]f𝒙|y(𝒙y=1c)𝐏𝐫[y=c]f𝒙|y(𝒙y=c).\displaystyle=\frac{1}{1+\frac{\mathop{{\bf Pr}\/}[y=1-c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=1-c)}{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=c)}}.

Suppose that 𝒙=𝐗ij\boldsymbol{x}=\mathbf{X}^{\prime}_{ij} such that iji\nsim j. Then h(𝒙)=0h^{*}(\boldsymbol{x})=0 if and only if 𝐏𝐫[y=0𝒙]12\mathop{{\bf Pr}\/}[y=0\mid\boldsymbol{x}]\geq\frac{1}{2}. Hence, for c=0c=0 we require that

𝐏𝐫[y=1c]f𝒙|y(𝒙y=1c)𝐏𝐫[y=c]f𝒙|y(𝒙y=c)=pqf𝒙|y(𝒙y=1)f𝒙|y(𝒙y=0)=pqcosh(1σ2𝒙T𝝁)cosh(1σ2𝒙T𝝂)1,\frac{\mathop{{\bf Pr}\/}[y=1-c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=1-c)}{\mathop{{\bf Pr}\/}[y=c]\cdot f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=c)}=\frac{p}{q}\frac{f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=1)}{f_{\boldsymbol{x}|y}(\boldsymbol{x}\mid y=0)}=\frac{p}{q}\frac{\cosh\left({\frac{1}{\sigma^{2}}\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}\right)}{\cosh\left({\frac{1}{\sigma^{2}}\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}\right)}\leq 1,

Similarly, we obtain the reverse condition for h(𝒙)=1h^{*}(\boldsymbol{x})=1.     

A.5 Proof of Theorem 9

We restate Theorem 9 for convenience.

Theorem.

Suppose 𝛍=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for some κ>0\kappa>0 and let Ψ\Psi be any attention mechanism. Then,

  1. 1.

    With probability at least 1o(1)1-o(1), Ψ\Psi fails to correctly classify at least 2Φc(κ)22\cdot\Phi_{\mathrm{c}}(\kappa)^{2} fraction of inter-class edges;

  2. 2.

    For any K>1K>1 if q>Klog2nnΦc(κ)2q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}}, then with probability at least 1O(nK4Φc(κ)2logn)1-O(n^{-\frac{K}{4}\Phi_{\rm c}(\kappa)^{2}\log n}), Ψ\Psi misclassify at least one inter-class edge.

We will write iji\sim j if node ii and node jj are in the same class and iji\nsim j otherwise. From Lemma 8, we observe that for successful classification by the optimal classifier, we need

pcosh(𝒙T𝝁σ2)qcosh(𝒙T𝝂σ2)\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right) forij,\displaystyle\quad\text{for}\;i\nsim j,
pcosh(𝒙T𝝁σ2)>qcosh(𝒙T𝝂σ2)\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right) forij.\displaystyle\quad\text{for}\;i\sim j.

We will split the analysis into two cases. First, note that when pqp\geq q we have for iji\nsim j that

pcosh(𝒙T𝝁σ2)qcosh(𝒙T𝝂σ2)cosh(𝒙T𝝁σ2)cosh(𝒙T𝝂σ2)|𝒙T𝝁||𝒙T𝝂|.\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)\leq\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies|\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}|\leq|\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}|.

In the first implication, we used that pqp\geq q, while the second implication follows from the fact that cosh(a)cosh(b)|a||b|\cosh(a)\leq\cosh(b)\implies|a|\leq|b| for all a,ba,b\in\mathbb{R}. Similarly, for p<qp<q we have for iji\sim j that

pcosh(𝒙T𝝁σ2)>qcosh(𝒙T𝝂σ2)cosh(𝒙T𝝁σ2)>cosh(𝒙T𝝂σ2)|𝒙T𝝁|>|𝒙T𝝂|.\displaystyle p\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>q\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}}{\sigma^{2}}}\right)>\cosh\left({\tfrac{\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}}{\sigma^{2}}}\right)\implies|\boldsymbol{x}^{T}\boldsymbol{\mu}^{\prime}|>|\boldsymbol{x}^{T}\boldsymbol{\nu}^{\prime}|.

Therefore, for each of the above cases, we can upper bound the probability for either iji\sim j or iji\nsim j that 𝐗ij\mathbf{X}^{\prime}_{ij} is correctly classified, by the probability of the event |𝐗ijT𝝁||𝐗ijT𝝂||\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}|\leq|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}| or equivalently |𝐗ijT𝝁|>|𝐗ijT𝝂||\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}|>|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}|. We focus on the former as the latter is equivalent and symmetric. Writing 𝐗i=𝝁+σ𝒈i\mathbf{X}_{i}=\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i} and 𝐗j=𝝁+σ𝒈j\mathbf{X}_{j}=-\boldsymbol{\mu}+\sigma\boldsymbol{g}_{j}, we have that for iC1i\in C_{1} and jC0j\in C_{0},

𝐏𝐫[h(𝐗ij)=0]\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0] 𝐏𝐫[|𝐗ijT𝝁||𝐗ijT𝝂|]\displaystyle\leq\mathop{{\bf Pr}\/}\left[|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\mu}^{\prime}|\leq|\mathbf{X}^{{}^{\prime}T}_{ij}\boldsymbol{\nu}^{\prime}|\right]
=𝐏𝐫[|𝐗iT𝝁+𝐗jT𝝁||𝐗iT𝝁𝐗jT𝝁|]\displaystyle=\mathop{{\bf Pr}\/}\left[|\mathbf{X}_{i}^{T}\boldsymbol{\mu}+\mathbf{X}_{j}^{T}\boldsymbol{\mu}|\leq|\mathbf{X}_{i}^{T}\boldsymbol{\mu}-\mathbf{X}_{j}^{T}\boldsymbol{\mu}|\right]
=𝐏𝐫[σ|𝒈iT𝝁+𝒈jT𝝁||±2𝝁2+σ𝒈iT𝝁σ𝒈jT𝝁|]\displaystyle=\mathop{{\bf Pr}\/}\left[\sigma|\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}+\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}|\leq|\pm 2\|\boldsymbol{\mu}\|^{2}+\sigma\boldsymbol{g}_{i}^{T}\boldsymbol{\mu}-\sigma\boldsymbol{g}_{j}^{T}\boldsymbol{\mu}|\right]
𝐏𝐫[|𝒈iT𝝁^+𝒈jT𝝁^||𝒈iT𝝁^𝒈jT𝝁^|2𝝁σ]\displaystyle\leq\mathop{{\bf Pr}\/}\left[|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}|-|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}|\leq\frac{2\|\boldsymbol{\mu}\|}{\sigma}\right]
=𝐏𝐫[|𝒈iT𝝁^+𝒈jT𝝁^||𝒈iT𝝁^𝒈jT𝝁^|2κ],\displaystyle=\mathop{{\bf Pr}\/}\left[|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}+\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}|-|\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}}-\boldsymbol{g}_{j}^{T}\hat{\boldsymbol{\mu}}|\leq 2\kappa\right],

where we denote 𝝁^=𝝁/𝝁\hat{\boldsymbol{\mu}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\|. In the second to last step above, we used triangle inequality to pull 2𝝁22\|\boldsymbol{\mu}\|^{2} outside the absolute value, while in the last equation we use 𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma.

We now denote zi=𝒈iT𝝁^z_{i}=\boldsymbol{g}_{i}^{T}\hat{\boldsymbol{\mu}} for all i[n]i\in[n]. Then the above probability is 𝐏𝐫[|zi+zj||zizj|2κ]\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa], where zi,zjN(0,1)z_{i},z_{j}\sim N(0,1) are independent random variables. Note that we have

𝐏𝐫[h(𝐗ij)=0]\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0] 𝐏𝐫[|zi+zj||zizj|2κ]\displaystyle\leq\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa]
=𝐏𝐫[|zi+zj||zizj|2κ,|zi|κ]\displaystyle=\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa,|z_{i}|\leq\kappa]
+𝐏𝐫[|zi+zj||zizj|2κ,|zi|>κ]\displaystyle\qquad+\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa,|z_{i}|>\kappa]
=𝐏𝐫[|zi|κ]+Φ(κ)𝐏𝐫[|zi|>κ].\displaystyle=\mathop{{\bf Pr}\/}[|z_{i}|\leq\kappa]+\Phi(\kappa)\mathop{{\bf Pr}\/}[|z_{i}|>\kappa]. (15)

To see how we obtain the last equation, observe that if |zi|κ|z_{i}|\leq\kappa then we have

|zi+zj||zizj|\displaystyle|z_{i}+z_{j}|-|z_{i}-z_{j}| =|zi+zj||zjzi|\displaystyle=|z_{i}+z_{j}|-|z_{j}-z_{i}|
|zi|+|zj||zjzi|\displaystyle\leq|z_{i}|+|z_{j}|-|z_{j}-z_{i}| by triangle inequality
|zi|+|zj|||zj||zi||\displaystyle\leq|z_{i}|+|z_{j}|-\big{|}|z_{j}|-|z_{i}|\big{|} by reverse triangle inequality
|zi|+|zj|(|zj||zi|)=2|zi|\displaystyle\leq|z_{i}|+|z_{j}|-(|z_{j}|-|z_{i}|)=2|z_{i}|
2κ,\displaystyle\leq 2\kappa,

hence, 𝐏𝐫[|zi+zj||zizj|2κ,|zi|κ]=𝐏𝐫[|zi|κ]\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa,|z_{i}|\leq\kappa]=\mathop{{\bf Pr}\/}[|z_{i}|\leq\kappa]. On the other hand, for |zi|>κ|z_{i}|>\kappa, we look at each case, conditioned on the events zi>κz_{i}>\kappa and zi<κz_{i}<-\kappa for each of the four cases based on the signs of zi+zjz_{i}+z_{j} and zizjz_{i}-z_{j}. We denote by EE the event that |zi+zj||zizj|2κ|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa, and analyze the cases in detail. First consider the case zi<κz_{i}<-\kappa:

𝐏𝐫[E,zi+zj0,zizj0zi<κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}\geq 0\mid z_{i}<-\kappa] =𝐏𝐫[zjzi,zjzizi<κ]=0,\displaystyle=\mathop{{\bf Pr}\/}[z_{j}\leq z_{i},z_{j}\geq-z_{i}\mid z_{i}<-\kappa]=0,
𝐏𝐫[E,zi+zj0,zizj<0zi<κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}<0\mid z_{i}<-\kappa] =𝐏𝐫[zj>|zi|,ziκzi<κ]=Φ(zi),\displaystyle=\mathop{{\bf Pr}\/}[z_{j}>|z_{i}|,z_{i}\leq\kappa\mid z_{i}<-\kappa]=\Phi(z_{i}),
𝐏𝐫[E,zi+zj<0,zizj0zi<κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}\geq 0\mid z_{i}<-\kappa] =𝐏𝐫[zj<|zi|,ziκzi<κ]=0,\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-|z_{i}|,z_{i}\geq-\kappa\mid z_{i}<-\kappa]=0,
𝐏𝐫[E,zi+zj<0,zizj<0zi<κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}<0\mid z_{i}<-\kappa] =𝐏𝐫[zi<zj<zi,zj>κzi<κ]\displaystyle=\mathop{{\bf Pr}\/}[z_{i}<z_{j}<-z_{i},z_{j}>-\kappa\mid z_{i}<-\kappa]
=Φ(κ)Φ(zi).\displaystyle=\Phi(\kappa)-\Phi(z_{i}).

The sum of the four probabilities in the above is 𝐏𝐫[Ezi<κ]=Φ(κ)\mathop{{\bf Pr}\/}[E\mid z_{i}<-\kappa]=\Phi(\kappa). Similarly, we analyze the other case, zi>κz_{i}>\kappa:

𝐏𝐫[E,zi+zj0,zizj0zi>κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}\geq 0\mid z_{i}>\kappa] =𝐏𝐫[zizjzi,zjκzi>κ]\displaystyle=\mathop{{\bf Pr}\/}[-z_{i}\leq z_{j}\leq z_{i},z_{j}\leq\kappa\mid z_{i}>\kappa]
=Φ(κ)Φc(zi),\displaystyle=\Phi(\kappa)-\Phi_{\rm c}(z_{i}),
𝐏𝐫[E,zi+zj0,zizj<0zi>κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}\geq 0,z_{i}-z_{j}<0\mid z_{i}>\kappa] =𝐏𝐫[zj>|zi|,ziκzi>κ]=0,\displaystyle=\mathop{{\bf Pr}\/}[z_{j}>|z_{i}|,z_{i}\leq\kappa\mid z_{i}>\kappa]=0,
𝐏𝐫[E,zi+zj<0,zizj0zi>κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}\geq 0\mid z_{i}>\kappa] =𝐏𝐫[zj<|zi|,ziκzi>κ]=Φc(zi),\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-|z_{i}|,z_{i}\geq-\kappa\mid z_{i}>\kappa]=\Phi_{\rm c}(z_{i}),
𝐏𝐫[E,zi+zj<0,zizj<0zi>κ]\displaystyle\mathop{{\bf Pr}\/}[E,z_{i}+z_{j}<0,z_{i}-z_{j}<0\mid z_{i}>\kappa] =𝐏𝐫[zj<zi,zj>zizi>κ]=0.\displaystyle=\mathop{{\bf Pr}\/}[z_{j}<-z_{i},z_{j}>z_{i}\mid z_{i}>\kappa]=0.

The sum of the four probabilities above is 𝐏𝐫[Ezi>κ]=Φ(κ)\mathop{{\bf Pr}\/}[E\mid z_{i}>\kappa]=\Phi(\kappa). Therefore, we obtain that

𝐏𝐫[|zi+zj||zizj|2κ|zi|>κ]=Φ(κ),\mathop{{\bf Pr}\/}[|z_{i}+z_{j}|-|z_{i}-z_{j}|\leq 2\kappa\mid|z_{i}|>\kappa]=\Phi(\kappa),

which justifies (15).

Next, note that 𝐏𝐫[|zi|κ]=Φ(κ)Φc(κ)\mathop{{\bf Pr}\/}[|z_{i}|\leq\kappa]=\Phi(\kappa)-\Phi_{\rm c}(\kappa) and 𝐏𝐫[|zi|>κ]=2Φc(κ)\mathop{{\bf Pr}\/}[|z_{i}|>\kappa]=2\Phi_{\rm c}(\kappa), so we have from (15) that

𝐏𝐫[h(𝐗ij)=0]\displaystyle\mathop{{\bf Pr}\/}[h^{*}(\mathbf{X}^{\prime}_{ij})=0] Φ(κ)Φc(κ)+2Φc(κ)Φ(κ)\displaystyle\leq\Phi(\kappa)-\Phi_{\rm c}(\kappa)+2\Phi_{\rm c}(\kappa)\Phi(\kappa)
=12Φc(κ)+2Φc(κ)Φ(κ)=12Φc(κ)2.\displaystyle=1-2\Phi_{\rm c}(\kappa)+2\Phi_{\rm c}(\kappa)\Phi(\kappa)=1-2\Phi_{\rm c}(\kappa)^{2}.

Thus, 𝐗ij\mathbf{X}^{\prime}_{ij} is misclassified with probability at least 2Φc(κ)22\Phi_{\rm c}(\kappa)^{2}.

We will now construct sets of pairs with mutually independent elements, such that the union of those sets covers all inter-class edges. This will enable us to use a concentration argument that computes the fraction of the inter-class edges that are misclassified. Since the graph operations are permutation invariant, let us assume for simplicity that C0={1,,n2}C_{0}=\{1,\ldots,\frac{n}{2}\} and C1={n2+1,,n}C_{1}=\{\frac{n}{2}+1,\ldots,n\} for an even number of nodes nn. Also, define the function

m(i,l)={i+l,i+ln2,i+ln2,i+l>n2..m(i,l)=\begin{cases}i+l,&i+l\leq\frac{n}{2},\\ i+l-\frac{n}{2},&i+l>\frac{n}{2}.\end{cases}.

We now construct the following sequence of sets for all l{0,,n21}l\in\{0,\ldots,\frac{n}{2}-1\}:

Sl={(Xm(i,l),Xi+n2)for all iC0 such that (m(i,l),i+n/2)E}.S_{l}=\{(X_{m(i,l)},X_{i+\frac{n}{2}})\;\text{for all }i\in C_{0}\text{ such that }(m(i,l),i+n/2)\in E\}.

Fix l{0,,n21}l\in\{0,\ldots,\frac{n}{2}-1\} and observe that the pairs in the set SlS_{l} are mutually independent. Define a Bernoulli random variable, βi\beta_{i}, to be the indicator that (Xm(i,l),Xi+n2)(X_{m(i,l)},X_{i+\frac{n}{2}}) is misclassified. We have that 𝐄[βi]2Φc(κ)2\mathop{{\bf E}\/}[\beta_{i}]\geq 2\Phi_{\rm c}(\kappa)^{2}. Note that the fraction of pairs in the set SlS_{l} that are misclassified is 1|Sl|i:(Xm(i,l),Xi+n/2)Slβi\frac{1}{|S_{l}|}\sum_{i:(X_{m(i,l)},X_{i+n/2})\in S_{l}}\beta_{i}, which is a sum of independent Bernoulli random variables. Hence, by the additive Chernoff bound, we obtain

𝐏𝐫[iC0Nm(i,l)βi2|Sl|Φc(κ)2|Sl|t]1exp(2|Sl|t2).\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2|S_{l}|\Phi_{\rm c}(\kappa)^{2}-|S_{l}|t\right]\geq 1-\exp(-2|S_{l}|t^{2}).

Since p,q=Ω(log2nn)p,q=\Omega(\frac{\log^{2}n}{n}), we have by the Chernoff bound and a union bound that with probability at least 11/poly(n)1-1/{\rm poly}(n), |Sl|=nq(1±o(1))|S_{l}|=nq(1\pm o(1)) for all ll. We now choose t=Clogn|Sl|=o(1)t=\sqrt{\frac{C\log n}{|S_{l}|}}=o(1) to obtain that on the event where |Sl|=nq(1±o(1))|S_{l}|=nq(1\pm o(1)), we have the following for any large C>1C>1:

𝐏𝐫[1|Sl|iC0Nm(i,l)βi2Φc(κ)2o(1)]1nC.\mathop{{\bf Pr}\/}\left[\frac{1}{|S_{l}|}\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2\Phi_{\rm c}(\kappa)^{2}-o(1)\right]\geq 1-n^{-C}.

Following a union bound over all l{0,,n21}l\in\{0,\ldots,\frac{n}{2}-1\}, we conclude that for any c>0c>0,

𝐏𝐫[1|Sl|iC0Nm(i,l)βi2Φc(κ)2o(1),l{0,,n21}]1O(nc).\mathop{{\bf Pr}\/}\left[\frac{1}{|S_{l}|}\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2\Phi_{\rm c}(\kappa)^{2}-o(1),\;\;\forall l\in\left\{0,\ldots,\frac{n}{2}-1\right\}\right]\geq 1-O(n^{-c}).

Thus, out of all the pairs 𝐗ij\mathbf{X}^{\prime}_{ij} with jij\nsim i, with probability at least 1o(1)1-o(1), we have that at least a fraction 2Φc(κ)22\Phi_{\rm c}(\kappa)^{2} of the pairs are misclassified by the attention mechanism. This concludes part 1 of the theorem.

For part 2, note that by the additive Chernoff bound we have for any t(0,1)t\in(0,1),

𝐏𝐫[iC0Nm(i,l)βi2|Sl|Φc(κ)2|Sl|t]1exp(2|Sl|t2).\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq 2|S_{l}|\Phi_{\rm c}(\kappa)^{2}-|S_{l}|t\right]\geq 1-\exp(-2|S_{l}|t^{2}).

Since |Sl|=nq2(1±o(1))|S_{l}|=\frac{nq}{2}(1\pm o(1)) with probability at least 1/poly(n)1/{\rm poly}(n), we choose t=2KΦc(κ)2log2nnqt=2\sqrt{\frac{K\Phi_{\rm c}(\kappa)^{2}\log^{2}n}{nq}} to obtain

𝐏𝐫[iC0Nm(i,l)βinqΦc(κ)2(1±o(1))KnqΦc(κ)2log2n]1O(n8KΦc(κ)2logn).\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}\geq nq\Phi_{\rm c}(\kappa)^{2}(1\pm o(1))-\sqrt{Knq\Phi_{\rm c}(\kappa)^{2}\log^{2}n}\right]\geq 1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n}).

Now note that if q>Klog2nnΦc(κ)2q>\frac{K\log^{2}n}{n\Phi_{\rm c}(\kappa)^{2}} then we have nqΦc(κ)2>Klog2nnq\Phi_{\rm c}(\kappa)^{2}>K\log^{2}n, which implies that

nqΦc(κ)2KnqΦc(κ)2log2n>0.nq\Phi_{\rm c}(\kappa)^{2}-\sqrt{Knq\Phi_{\rm c}(\kappa)^{2}\log^{2}n}>0.

Hence, in this regime of qq,

𝐏𝐫[iC0Nm(i,l)βi>0]1O(n8KΦc(κ)2logn),\mathop{{\bf Pr}\/}\left[\sum_{i\in C_{0}\cap N_{m(i,l)}}\beta_{i}>0\right]\geq 1-O(n^{-8K\Phi_{\rm c}(\kappa)^{2}\log n}),

and the proof is complete.

A.6 Proof of Theorem 10

We restate Theorem 10 for convenience

Theorem.

Assume that 𝛍Kσ\|\boldsymbol{\mu}\|\leq K\sigma and σK\sigma\leq K^{\prime} for some absolute constants KK and KK^{\prime}. Moreover, assume that the parameters (𝐰,𝐚,b)d×2×(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R} are bounded. Then, with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), there exists a subset 𝒜[n]\mathcal{A}\subseteq[n] with cardinality at least n(1o(1))n(1-o(1)) such that for all i𝒜i\in\mathcal{A} the following hold:

  1. 1.

    There is a subset Ji,0NiC0J_{i,0}\subseteq N_{i}\cap C_{0} with cardinality at least 910|NiC0|\frac{9}{10}|N_{i}\cap C_{0}|, such that γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|) for all jJi,0j\in J_{i,0}.

  2. 2.

    There is a subset Ji,1NiC1J_{i,1}\subseteq N_{i}\cap C_{1} with cardinality at least 910|NiC1|\frac{9}{10}|N_{i}\cap C_{1}|, such that γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|) for all jJi,1j\in J_{i,1}.

For i[n]i\in[n] let us write 𝐗i=(2ϵi1)𝝁+σ𝒈i\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i} where 𝒈iN(0,𝐈)\boldsymbol{g}_{i}\sim N(0,\mathbf{I}), ϵi=0{\epsilon}_{i}=0 if iC0i\in C_{0} and ϵi=1{\epsilon}_{i}=1 if iC1i\in C_{1}. Moreover, since the parameters (𝒘,𝒂,b)d×2×(\boldsymbol{w},\boldsymbol{a},b)\in\mathbb{R}^{d}\times\mathbb{R}^{2}\times\mathbb{R} are bounded, we can write 𝒘=R𝒘^\boldsymbol{w}=R\hat{\boldsymbol{w}} and 𝒂=R𝒂^\boldsymbol{a}=R^{\prime}\hat{\boldsymbol{a}} such that 𝒘^=1\|\hat{\boldsymbol{w}}\|=1 and 𝒂^=1\|\hat{\boldsymbol{a}}\|=1 and R,RR,R^{\prime} are some constants. We define the following sets which will become useful later in our computation of γij\gamma_{ij}’s. Define

𝒜=def{i[n]||𝒂^1𝒘^T𝒈i|10log(n(p+q)),and|𝒂^2𝒘^T𝒈j|10log(n(p+q)),jNi}.\mathcal{A}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{i\in[n]~{}\bigg{|}~{}\begin{array}[]{l}|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))},\ \mbox{and}\\ |\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq 10\sqrt{\log(n(p+q))},\ \forall j\in N_{i}\end{array}\right\}.

For i[n]i\in[n] define

Ji,0\displaystyle J_{i,0} =def{jNiC0||𝒂^2𝒘^T𝒈j|10},\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{0}~{}|~{}|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq\sqrt{10}\right\},
Ji,1\displaystyle J_{i,1} =def{jNiC1||𝒂^2𝒘^T𝒈j|10},\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{1}~{}|~{}|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq\sqrt{10}\right\},
Bi,0t\displaystyle B_{i,0}^{t} =def{jNiC0|2t1𝒂^2𝒘^T𝒈j2t},t=1,2,,T,\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{0}~{}|~{}2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\leq 2^{t}\right\},\ t=1,2,\ldots,T,
Bi,1t\displaystyle B_{i,1}^{t} =def{jNiC1|2t1𝒂^2𝒘^T𝒈j2t},t=1,2,,T,\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\left\{j\in N_{i}\cap C_{1}~{}|~{}2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}\leq 2^{t}\right\},\ t=1,2,\ldots,T,

where T=deflog2(10log(n(p+q)))T\stackrel{{\scriptstyle\rm def}}{{=}}\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil.

We start with a few claims about the sizes of these sets.

Claim 20.

With probability at least 1o(1)1-o(1), we have that |𝒜|n(1o(1))|\mathcal{A}|\geq n(1-o(1)).

Proof:  Because |𝒂^2|1|\hat{\boldsymbol{a}}_{2}|\leq 1 we know that 𝒜\mathcal{A} is a superset of 𝒜\mathcal{A}^{\prime} where

𝒜=def{i[n]||𝒘^T𝒈i|10log(n(p+q)),and|𝒘^T𝒈j|10log(n(p+q)),jNi}.\mathcal{A}^{\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{i\in[n]~{}\bigg{|}~{}\begin{array}[]{l}|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))},\ \mbox{and}\\ |\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\leq 10\sqrt{\log(n(p+q))},\ \forall j\in N_{i}\end{array}\right\}.

We give a lower bound for |𝒜||\mathcal{A}^{\prime}| and hence prove the result. First of all, note that if p+qΩ(1/log2n)p+q\geq\Omega(1/\log^{2}n), then log(n(p+q))=logn(1o(1))\log(n(p+q))=\log n(1-o(1)) and we easily get that with probability at least 1o(1)1-o(1), |𝒘^T𝒈i|10log(n(p+q))|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\leq 10\sqrt{\log(n(p+q))} for all i[n]i\in[n], and thus |𝒜|=|𝒜|=n|\mathcal{A}|=|\mathcal{A}^{\prime}|=n. Therefore let us assume without loss of generality that p+qO(1/log2n)p+q\leq O(1/\log^{2}n). Consider the following sum of indicator random variables

S=defi[n]𝟏{|𝒘^T𝒈i|10log(n(p+q))}.S\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{i\in[n]}{\bf 1}_{\left\{|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\geq 10\sqrt{\log(n(p+q))}\right\}}.

By the multiplicative Chernoff bound, for any δ>0\delta>0 we have

𝐏𝐫[Snb(1+δ)]exp(δ22+δnb)\operatorname{{\bf Pr}}\left[S\geq nb(1+\delta)\right]\leq\exp\left(-\frac{\delta^{2}}{2+\delta}nb\right)

where b=def𝐏𝐫(|𝒘^T𝒈i|10log(n(p+q)))b\stackrel{{\scriptstyle\rm def}}{{=}}\operatorname{{\bf Pr}}(|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\geq 10\sqrt{\log(n(p+q))}). Moreover, by the standard upper bound on the Gaussian tail probability (Proposition 2.1.2, [45]) we know that b<e50log(n(p+q))b<e^{-50\log(n(p+q))}. Let us set

δ=def1bn(p+q)logn.\delta\stackrel{{\scriptstyle\rm def}}{{=}}\frac{1}{bn(p+q)\log n}.

Then by the upper bound on bb and the assumption that p,q=Ω(log2n/n)p,q=\Omega(\log^{2}n/n) we know that

δ(n(p+q))49lognΩ(log97n)=ω(1).\delta\geq\frac{(n(p+q))^{49}}{\log n}\geq\Omega(\log^{97}n)=\omega(1).

It follows that

δ22+δnbΩ(δnb)=Ω(1(p+q)logn)Ω(logn).\frac{\delta^{2}}{2+\delta}nb\geq\Omega(\delta nb)=\Omega\left(\frac{1}{(p+q)\log n}\right)\geq\Omega(\log n).

Therefore, with probability at least 1o(1)1-o(1) we have that

Snb(1+δ)n(n(p+q))50+nn(p+q)logn=O(nn(p+q)logn).S\leq nb(1+\delta)\leq\frac{n}{(n(p+q))^{50}}+\frac{n}{n(p+q)\log n}=O\left(\frac{n}{n(p+q)\log n}\right).

Apply the concentration result of node degrees, this means that with probability at least 1o(1)1-o(1),

|{i[n]||𝒘^T𝒈i|10log(n(p+q))orjNisuch that|𝒘^T𝒈j|10log(n(p+q))}|\displaystyle\left|\left\{i\in[n]~{}\big{|}~{}|\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|\geq 10\sqrt{\log(n(p+q))}\ \mbox{or}\ \exists j\in N_{i}\ \mbox{such that}\ |\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|\geq 10\sqrt{\log(n(p+q))}\right\}\right|
Sn2(p+q)(1±o(1))=O(nn(p+q)logn)n2(p+q)(1±o(1))=O(nlogn).\displaystyle\leq\ S\cdot\frac{n}{2}(p+q)(1\pm o(1))=O\left(\frac{n}{n(p+q)\log n}\right)\cdot\frac{n}{2}(p+q)(1\pm o(1))=O\left(\frac{n}{\log n}\right).

Therefore we have

|𝒜|nO(n/logn)=n(1o(1)).|\mathcal{A}^{\prime}|\geq n-O(n/\log n)=n(1-o(1)).

 

Claim 21.

With probability at least 1o(1)1-o(1), we have that for all i[n]i\in[n],

|Ji,0|910|NiC0|and|Ji,1|910|NiC1|.|J_{i,0}|\geq\frac{9}{10}|N_{i}\cap C_{0}|~{}~{}\mbox{and}~{}~{}|J_{i,1}|\geq\frac{9}{10}|N_{i}\cap C_{1}|.

Proof:  We prove the result for Ji,0J_{i,0}, the result for Ji,1J_{i,1} follows analogously. First, fix i[n]i\in[n]. For each j|NiC0|j\in|N_{i}\cap C_{0}| we have that

𝐏𝐫[|𝒂^2𝒘T𝒈j|10]𝐏𝐫[|𝒘T𝒈j|10]e50.\operatorname{{\bf Pr}}[|\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}]\leq\operatorname{{\bf Pr}}[|\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}]\leq e^{-50}.

Denote Ji,0c=def(NiC0)Ji,0J_{i,0}^{c}\stackrel{{\scriptstyle\rm def}}{{=}}(N_{i}\cap C_{0})\setminus J_{i,0}. We have that

𝐄[|Ji,0c|]=𝐄[jNiC0𝟏{|𝒂^2𝒘T𝒈j|10}]e50|NiC0|,\mathop{{\bf E}\/}[|J_{i,0}^{c}|]=\mathop{{\bf E}\/}\left[\sum_{j\in N_{i}\cap C_{0}}{\bf 1}_{\left\{|\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{j}|\geq\sqrt{10}\right\}}\right]\leq e^{-50}|N_{i}\cap C_{0}|,

Apply Chernoff’s inequality (Theorem 2.3.4 in [45]) we have

𝐏𝐫[|Ji,0c|110|NiC0|]\displaystyle\operatorname{{\bf Pr}}\left[|J_{i,0}^{c}|\geq\frac{1}{10}|N_{i}\cap C_{0}|\right] e𝐄[|Ji,0c|](e𝐄[|Ji,0c|]|NiC0|/10)|NiC0|/10\displaystyle\leq e^{-\mathop{{\bf E}\/}[|J_{i,0}^{c}|]}\left(\frac{e\mathop{{\bf E}\/}[|J_{i,0}^{c}|]}{|N_{i}\cap C_{0}|/10}\right)^{|N_{i}\cap C_{0}|/10}
(ee50|NiC0||NiC0|/10)|NiC0|/10\displaystyle\leq\left(\frac{ee^{-50}|N_{i}\cap C_{0}|}{|N_{i}\cap C_{0}|/10}\right)^{|N_{i}\cap C_{0}|/10}
=exp((12log1010110)|NiC0|)\displaystyle=\exp\left(-\left(\frac{1}{2}-\frac{\log 10}{10}-\frac{1}{10}\right)|N_{i}\cap C_{0}|\right)
exp(425|NiC0|).\displaystyle\leq\exp\left(-\frac{4}{25}|N_{i}\cap C_{0}|\right).

Apply the union bound we get

𝐏𝐫[|Ji,0|910|C0Ni|,i[n]]\displaystyle\operatorname{{\bf Pr}}\left[|J_{i,0}|\geq\frac{9}{10}|C_{0}\cap N_{i}|,\forall i\in[n]\right] 1i[n]exp(425|NiC0|)\displaystyle\geq 1-\sum_{i\in[n]}\exp\left(-\frac{4}{25}|N_{i}\cap C_{0}|\right)
𝐏𝐫(𝓔3)(1i[n]exp(425nmin(p,q)(1o(1))2))\displaystyle\geq\operatorname{{\bf Pr}}(\boldsymbol{\mathcal{E}}_{3})\cdot\left(1-\sum_{i\in[n]}\exp\left(-\frac{4}{25}\frac{n\min(p,q)(1-o(1))}{2}\right)\right)
=(1o(1))(1nexp(2nmin(p,q)(1o(1))25))\displaystyle=(1-o(1))\cdot\left(1-n\exp\left(-\frac{2n\min(p,q)(1-o(1))}{25}\right)\right)
=1o(1).\displaystyle=1-o(1).

The second inequality follows because |NiC0|n2min(p,q)(1o(1))|N_{i}\cap C_{0}|\geq\frac{n}{2}\min(p,q)(1-o(1)) under the event 𝓔3\boldsymbol{\mathcal{E}}_{3} (cf. Definition 17) for all i[n]i\in[n]. The last equality is due to our assumption that p,q=Ω(log2nn)p,q=\Omega(\frac{\log^{2}n}{n}).     

Claim 22.

With probability at least 1o(1)1-o(1), we have that for all i[n]i\in[n] and for all t[T]t\in[T],

|Bi,0t|𝐄[|Bi,0t|]+T|NiC0|45and|Bi,1t|𝐄[|Bi,1t|]+T|NiC1|45.|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}~{}~{}\mbox{and}~{}~{}|B_{i,1}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,1}^{t}|]+\sqrt{T}|N_{i}\cap C_{1}|^{\frac{4}{5}}.

Proof:  We prove the result for Bi,0tB_{i,0}^{t}, and the result for Bi,1tB_{i,1}^{t} follows analogously. First fix i[n]i\in[n] and t[T]t\in[T]. By the additive Chernoff inequality, we have

𝐏𝐫(|Bi,0t|𝐄[|Bi,0t|]+|NiC0|T|NiC0|15)e2T|NiC0|3/5.\operatorname{{\bf Pr}}\left(|B_{i,0}^{t}|\geq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+|N_{i}\cap C_{0}|\cdot\sqrt{T}|N_{i}\cap C_{0}|^{-\frac{1}{5}}\right)\leq e^{-2T|N_{i}\cap C_{0}|^{3/5}}.

Taking a union bound over all i[n]i\in[n] and t[T]t\in[T] we get

𝐏𝐫[i[n]t[T]{|Bi,0t|𝐄[|Bi,0t|]+T|NiC0|45}]\displaystyle\operatorname{{\bf Pr}}\left[\bigcup_{i\in[n]}\bigcup_{t\in[T]}\left\{|B_{i,0}^{t}|\geq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}\right\}\right]
\displaystyle\leq~{} nTexp(2T(n2min(p,q)(1o(1)))3/5)+o(1)=o(1),\displaystyle nT\exp\left(-2T\left(\frac{n}{2}\min(p,q)(1-o(1))\right)^{3/5}\right)+o(1)~{}=~{}o(1),

where the last equality follows from Assumption 1 that p,q=Ω(log2nn)p,q=\Omega(\frac{\log^{2}n}{n}), and hence

nTexp(2T(n2min(p,q)(1o(1)))3/5)\displaystyle nT\exp\left(-2T\left(\frac{n}{2}\min(p,q)(1-o(1))\right)^{3/5}\right) =nTexp(ω(2Tlogn))=O(nc)\displaystyle=nT\exp\left(-\omega\left(\sqrt{2}T\log n\right)\right)=O\left(n^{-c}\right)

for some absolute constant c>0c>0. Moreover, we have used degree concentration, which introduced the additional additive o(1)o(1) term in the probability upper bound. Therefore we have

𝐏𝐫[|Bi,0t|𝐄[|Bi,0t|]+T|NiC0|45,i[n]t[T]]1o(1).\operatorname{{\bf Pr}}\left[|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}},\forall i\in[n]~{}\forall t\in[T]\right]\geq 1-o(1).

 

We start by defining an event 𝓔#\boldsymbol{\mathcal{E}}^{\#} which is the intersection of the following events over the randomness of 𝐀\mathbf{A} and {ϵi}i\{{\epsilon}_{i}\}_{i} and 𝐗i=(2ϵi1)𝝁+σ𝒈i\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i},

  • 𝓔1\boldsymbol{\mathcal{E}}_{1}^{\prime} is the event that for each i[n]i\in[n], |C0Ni|=n2((1ϵi)p+ϵiq)(1±o(1))|C_{0}\cap N_{i}|=\frac{n}{2}((1-{\epsilon}_{i})p+{\epsilon}_{i}q)(1\pm o(1)) and |C1Ni|=n2((1ϵi)q+ϵip)(1±o(1))|C_{1}\cap N_{i}|=\frac{n}{2}((1-{\epsilon}_{i})q+{\epsilon}_{i}p)(1\pm o(1)).

  • 𝓔2\boldsymbol{\mathcal{E}}_{2}^{\prime} is the event that |𝒜|no(n)|\mathcal{A}|\geq n-o(\sqrt{n}).

  • 𝓔3\boldsymbol{\mathcal{E}}_{3}^{\prime} is the event that |Ji,0|910|NiC0||J_{i,0}|\geq\frac{9}{10}|N_{i}\cap C_{0}| and |Ji,1|910|NiC1||J_{i,1}|\geq\frac{9}{10}|N_{i}\cap C_{1}| for all i[n]i\in[n].

  • 𝓔4\boldsymbol{\mathcal{E}}_{4}^{\prime} is the event that |Bi,0t|𝐄[|Bi,0t|]+T|NiC0|45|B_{i,0}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,0}^{t}|]+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}} and |Bi,1t|𝐄[|Bi,1t|]+T|NiC1|45|B_{i,1}^{t}|\leq\mathop{{\bf E}\/}[|B_{i,1}^{t}|]+\sqrt{T}|N_{i}\cap C_{1}|^{\frac{4}{5}} for all i[n]i\in[n] and for all t[T]t\in[T].

By Claims 202122, we get that with probability at least 1o(1)1-o(1), the event 𝓔#=defi=14𝓔i\boldsymbol{\mathcal{E}}^{\#}\stackrel{{\scriptstyle\rm def}}{{=}}\bigcap_{i=1}^{4}\boldsymbol{\mathcal{E}}_{i}^{\prime} holds. We will show that under event 𝓔#\boldsymbol{\mathcal{E}}^{\#}, for all i𝒜i\in\mathcal{A}, for all jJi,cj\in J_{i,c} where c{0,1}c\in\{0,1\}, we have γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|). This will prove Theorem 10.

Fix i𝒜i\in\mathcal{A} and some jJi,0j\in J_{i,0}. Let us consider

γij\displaystyle\gamma_{ij} =exp(LeakyRelu(𝒂1𝒘T𝐗i+𝒂2𝒘T𝐗j+b))kNiexp(LeakyRelu(𝒂1𝒘T𝐗i+𝒂2𝒘T𝐗k+b))\displaystyle=\frac{\exp\left(\mathrm{LeakyRelu}(\boldsymbol{a}_{1}\boldsymbol{w}^{T}\mathbf{X}_{i}+\boldsymbol{a}_{2}\boldsymbol{w}^{T}\mathbf{X}_{j}+b)\right)}{\sum_{k\in N_{i}}\exp\left(\mathrm{LeakyRelu}(\boldsymbol{a}_{1}\boldsymbol{w}^{T}\mathbf{X}_{i}+\boldsymbol{a}_{2}\boldsymbol{w}^{T}\mathbf{X}_{k}+b)\right)}
=exp(σRRLeakyRelu(κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b))kNiexp(σRRLeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b))\displaystyle=\frac{\exp\left(\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\right)}{\sum_{k\in N_{i}}\exp\left(\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})\right)}
=1kNiexp(ΔikΔij)\displaystyle=\frac{1}{\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})}

where for lNil\in N_{i}, we denote

κil\displaystyle\kappa_{il} =def(2ϵi1)𝒘^T𝝁/σ+(2ϵl1)𝒘^T𝝁/σ,\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}(2{\epsilon}_{i}-1)\hat{\boldsymbol{w}}^{T}\boldsymbol{\mu}/\sigma+(2{\epsilon}_{l}-1)\hat{\boldsymbol{w}}^{T}\boldsymbol{\mu}/\sigma,
Δil\displaystyle\Delta_{il} =defσRRLeakyRelu(κil+𝒂^1𝒘T𝒈i+𝒂^2𝒘T𝒈l+b),\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\sigma RR^{\prime}\ \mathrm{LeakyRelu}(\kappa_{il}+\hat{\boldsymbol{a}}_{1}\boldsymbol{w}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\boldsymbol{w}^{T}\boldsymbol{g}_{l}+b^{\prime}),

and b=σRRbb=\sigma RR^{\prime}b^{\prime}. We will show that

kNiexp(ΔikΔij)=Θ(|Ni|)\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\Theta(|N_{i}|)

and hence conclude that γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|). First of all, note that since 𝝁Kσ\|\boldsymbol{\mu}\|\leq K\sigma for some absolute constant KK, we know that

|κil|2K=O(1).|\kappa_{il}|\leq\sqrt{2}K=O(1).

Let us assume that 𝒂^1𝒘^T𝒈i0\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0 and consider the following two cases regarding the magnitude of 𝒂^1𝒘^T𝒈i\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}.

Case 1. If κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b<0\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0, then

ΔikΔij\displaystyle\Delta_{ik}-\Delta_{ij} =σRR(LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})
LeakyRelu(κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b))\displaystyle\qquad-\mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}
=σRR(LeakyRelu(𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k±O(1))\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1))
β(κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b))\displaystyle\qquad-\beta(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}
=σRR(LeakyRelu(𝒂^2𝒘^T𝒈k±O(1))±O(1))\displaystyle=\sigma RR^{\prime}\left(\mathrm{LeakyRelu}(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1))\pm O(1)\right)
=σRR(Θ(𝒂^2𝒘^T𝒈k)±O(1)),\displaystyle=\sigma RR^{\prime}\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),

where β\beta is the slope of LeakyRelu(x)\mathrm{LeakyRelu}(x) for x<0x<0. Here, the second equality follows from |κik+b|2K+|b|=O(1)|\kappa_{ik}+b^{\prime}|\leq\sqrt{2}K+|b^{\prime}|=O(1) and κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b<0\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0. The third equality follows from

  • We have jJi,0j\in J_{i,0} and hence |𝒂^2𝒘^T𝒈j|=O(1)|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|=O(1);

  • We have κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b<0\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}<0, so 𝒂^1𝒘^T𝒈i<|κij|+|𝒂^2𝒘^T𝒈j|+|b|=O(1)\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<|\kappa_{ij}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|+|b^{\prime}|=O(1), moreover, because 𝒂^1𝒘^T𝒈i0\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0, we get that |𝒂^1𝒘^T𝒈i|=O(1)|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|=O(1);

  • We have |κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b||𝒂^1𝒘^T𝒈i|+|𝒂^2𝒘^T𝒈j|+|κij+b|=O(1)+O(1)+O(1)=O(1)|\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}|\leq|\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}|+|\kappa_{ij}+b^{\prime}|=O(1)+O(1)+O(1)=O(1).

Case 2. If κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b0\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime}\geq 0, then

ΔikΔij\displaystyle\Delta_{ik}-\Delta_{ij} =σRR(LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})
LeakyRelu(κij+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈j+b))\displaystyle\qquad-\mathrm{LeakyRelu}(\kappa_{ij}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}+b^{\prime})\Big{)}
=σRR(LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)\displaystyle=\sigma RR^{\prime}\Big{(}\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})
κij𝒂^1𝒘^T𝒈i𝒂^2𝒘^T𝒈jb)\displaystyle\qquad-\kappa_{ij}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}-\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{j}-b^{\prime}\Big{)}
=σRR(LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)𝒂^1𝒘^T𝒈i±O(1))\displaystyle=\sigma RR^{\prime}\left(\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)\right)
{=σRR(Θ(𝒂^2𝒘^T𝒈k)±O(1)),ifkJi,0Ji,1σRR(O(𝒂^2𝒘^T𝒈k)±O(1)),otherwise.\displaystyle\left\{\begin{array}[]{ll}=\sigma RR^{\prime}\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),&\mbox{if}\ k\in J_{i,0}\cup J_{i,1}\\ \leq\sigma RR^{\prime}\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right),&\mbox{otherwise}.\end{array}\right.

To see the last (in)equality in the above, consider the following cases:

  1. 1.

    If kJi,0Ji,1k\in J_{i,0}\cup J_{i,1}, then there are two cases depending on the sign of κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}.

    • If κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b0\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}\geq 0, then we have that

      LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)𝒂^1𝒘^T𝒈i±O(1)\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b𝒂^1𝒘^T𝒈i±O(1)\displaystyle\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} 𝒂^2𝒘^T𝒈k+κik+b±O(1)\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\kappa_{ik}+b^{\prime}\pm O(1)
      =\displaystyle=~{} 𝒂^2𝒘^T𝒈k±O(1).\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1).
    • If κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b<0\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}<0, then because 𝒂^1𝒘^T𝒈i0\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0 and |κik+𝒂^2𝒘^T𝒈k+b||κik|+|𝒂^2𝒘^T𝒈k|+|b|=O(1)|\kappa_{ik}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}|\leq|\kappa_{ik}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|+|b^{\prime}|=O(1), we know that 𝒂^1𝒘^T𝒈i<|κik|+|𝒂^2𝒘^T𝒈k|+|b|=O(1)\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<|\kappa_{ik}|+|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|+|b^{\prime}|=O(1) and |κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b|=O(1)|\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}|=O(1). Therefore it follows that

      LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)𝒂^1𝒘^T𝒈i±O(1)\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} LeakyRelu(±O(1))O(1)±O(1)\displaystyle\mathrm{LeakyRelu}(\pm O(1))-O(1)\pm O(1)
      =\displaystyle=~{} ±O(1)\displaystyle\pm O(1)
      =\displaystyle=~{} 𝒂^2𝒘^T𝒈k±O(1)\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1)

      where the last equality is due to the fact that kJi,0Ji,1k\in J_{i,0}\cup J_{i,1} so |𝒂^2𝒘^T𝒈k|=O(1)|\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}|=O(1).

  2. 2.

    If kJi,0Ji,1k\not\in J_{i,0}\cup J_{i,1}, then there are two cases depending on the sign of κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}.

    • If κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b0\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}\geq 0, then we have that

      LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)𝒂^1𝒘^T𝒈i±O(1)\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b𝒂^1𝒘^T𝒈i±O(1)\displaystyle\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} 𝒂^2𝒘^T𝒈k+κik+b±O(1)\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\kappa_{ik}+b^{\prime}\pm O(1)
      =\displaystyle=~{} 𝒂^2𝒘^T𝒈k±O(1).\displaystyle\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1).
    • If κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b<0\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime}<0, then we have that,

      LeakyRelu(κik+𝒂^1𝒘^T𝒈i+𝒂^2𝒘^T𝒈k+b)𝒂^1𝒘^T𝒈i±O(1)\displaystyle\mathrm{LeakyRelu}(\kappa_{ik}+\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+b^{\prime})-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} βκik+β𝒂^1𝒘^T𝒈i+β𝒂^2𝒘^T𝒈k+βb𝒂^1𝒘^T𝒈i±O(1)\displaystyle\beta\kappa_{ik}+\beta\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}+\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}+\beta b^{\prime}-\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      =\displaystyle=~{} β𝒂^2𝒘^T𝒈k(1β)𝒂^1𝒘^T𝒈i±O(1)\displaystyle\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}-(1-\beta)\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\pm O(1)
      \displaystyle\leq~{} β𝒂^2𝒘^T𝒈k±O(1),\displaystyle\beta\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\pm O(1),

      where β\beta is the slope of LeakyRelu()\mathrm{LeakyRelu}(\cdot).

Combining the two cases regarding the magnitude of 𝒂^1𝒘^T𝒈i\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i} and our assumption that σ,R,R=O(1)\sigma,R,R=O(1), so far we have showed that, for any ii such that 𝒂^1𝒘^T𝒈i0\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}\geq 0, for all jJi,0j\in J_{i,0}, we have

ΔikΔij={Θ(𝒂^2𝒘^T𝒈k)±O(1),ifkJi,0Ji,1O(𝒂^2𝒘^T𝒈k)±O(1),otherwise.\Delta_{ik}-\Delta_{ij}=\left\{\begin{array}[]{ll}\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1),&\mbox{if}\ k\in J_{i,0}\cup J_{i,1}\\ O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1),&\mbox{otherwise}.\end{array}\right. (16)

By following a similar argument, one can show that Equation 16 holds for any ii such that 𝒂^1𝒘^T𝒈i<0\hat{\boldsymbol{a}}_{1}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{i}<0.

Let us now compute

kNiexp(ΔikΔij)=kNiC0exp(ΔikΔij)+kNiC1exp(ΔikΔij)\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})+\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij})

for some jJi,0j\in J_{i,0}. Let us focus on kNiC0exp(ΔikΔij)\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij}) first. We will show that Ω(|NiC0|)kNiC0exp(ΔikΔij)O(|Ni|)\Omega(|N_{i}\cap C_{0}|)\leq\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|).

First of all, we have that

kNiC0exp(ΔikΔij)kJi,0exp(ΔikΔij)=kJi,0exp(Θ(𝒂^2𝒘^T𝒈k)±O(1))kJi,0ec1=|Ji,0|ec1=Ω(|NiC0|),\begin{split}\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})&\geq\sum_{k\in J_{i,0}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in J_{i,0}}\exp\left(\Theta(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\\ &\geq\sum_{k\in J_{i,0}}e^{c_{1}}=|J_{i,0}|e^{c_{1}}=\Omega(|N_{i}\cap C_{0}|),\end{split} (17)

where c1c_{1} is an absolute constant (possibly negative). On the other hand, consider the following partition of NiC0N_{i}\cap C_{0}:

P1\displaystyle P_{1} =def{kNiC0|𝒂^2𝒘^T𝒈k1},\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{k\in N_{i}\cap C_{0}~{}|~{}\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\leq 1\},
P2\displaystyle P_{2} =def{kNiC0|𝒂^2𝒘^T𝒈k1}.\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{k\in N_{i}\cap C_{0}~{}|~{}\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 1\}.

It is easy to see that

kP1exp(ΔikΔij)kP1exp(O(𝒂^2𝒘^T𝒈k)±O(1))kP1ec2=|P1|ec2=O(|NiC0|),\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq\sum_{k\in P_{1}}\exp\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\leq\sum_{k\in P_{1}}e^{c_{2}}=|P_{1}|e^{c_{2}}=O(|N_{i}\cap C_{0}|), (18)

where c2c_{2} is an absolute constant. Moreover, because i𝒜i\in\mathcal{A} we have that P2t[T]Bi,0tP_{2}\subseteq\bigcup_{t\in[T]}B_{i,0}^{t}. It follows that

kP2exp(ΔikΔij)=t[T]kBi,0texp(ΔikΔij)t[T]kBi,0texp(O(𝒂^2𝒘^T𝒈k)±O(1))t[T]|Bi,0t|ec32t,\begin{split}\sum_{k\in P_{2}}\exp(\Delta_{ik}-\Delta_{ij})&=\sum_{t\in[T]}\sum_{k\in B_{i,0}^{t}}\exp(\Delta_{ik}-\Delta_{ij})\\ &\leq\sum_{t\in[T]}\sum_{k\in B_{i,0}^{t}}\exp\left(O(\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k})\pm O(1)\right)\\ &\leq\sum_{t\in[T]}|B_{i,0}^{t}|e^{c_{3}2^{t}},\end{split} (19)

where c3c_{3} is an absolute constant. We can upper bound the above quantity as follows. Under the Event 𝓔\boldsymbol{\mathcal{E}}^{*}, we have that

|Bi,0t|mt+T|NiC0|45,for allt[T],|B_{i,0}^{t}|\leq m_{t}+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}},\ \mbox{for all}\ t\in[T],

where

mt=def𝐄[|Bi,0t|]\displaystyle m_{t}\stackrel{{\scriptstyle\rm def}}{{=}}\mathop{{\bf E}\/}[|B_{i,0}^{t}|] =kNiC0𝐏𝐫(2t1𝒂^2𝒘^T𝒈k2t)kNiC0𝐏𝐫[𝒂^2𝒘^T𝒈k2t1]\displaystyle=\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}(2^{t-1}\leq\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\leq 2^{t})\leq\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}[\hat{\boldsymbol{a}}_{2}\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 2^{t-1}]
kNiC0𝐏𝐫[𝒘^T𝒈k2t1]|NiC0|e22t3.\displaystyle\leq\sum_{k\in N_{i}\cap C_{0}}\operatorname{{\bf Pr}}[\hat{\boldsymbol{w}}^{T}\boldsymbol{g}_{k}\geq 2^{t-1}]\leq|N_{i}\cap C_{0}|e^{-2^{2t-3}}.

It follows that

t[T]|Bi,0t|ec32tt[T](|NiC0|e22t3+T|NiC0|45)ec32t|NiC0|t=1e22t3ec32t+t[T]T|NiC0|45ec32Tc4|NiC0|+o(|Ni|)O(|Ni|),\begin{split}\sum_{t\in[T]}|B_{i,0}^{t}|e^{c_{3}2^{t}}&\leq\sum_{t\in[T]}\left(|N_{i}\cap C_{0}|e^{-2^{2t-3}}+\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}\right)e^{c_{3}2^{t}}\\ &\leq|N_{i}\cap C_{0}|\sum_{t=1}^{\infty}e^{-2^{2t-3}}e^{c_{3}2^{t}}+\sum_{t\in[T]}\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}\\ &\leq c_{4}|N_{i}\cap C_{0}|+o(|N_{i}|)\\ &\leq O(|N_{i}|),\end{split} (20)

where c4c_{4} is an absolute constant. The third inequality in the above follows from

  • The series t=1e22t3ec32t\sum_{t=1}^{\infty}e^{-2^{2t-3}}e^{c_{3}2^{t}} converges absolutely for any constant c3c_{3};

  • The sum t[T]T|NiC0|45ec32T=T32|NiC0|45ec32T=o(|Ni|)\sum_{t\in[T]}\sqrt{T}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}=T^{\frac{3}{2}}|N_{i}\cap C_{0}|^{\frac{4}{5}}e^{c_{3}2^{T}}=o(|N_{i}|) because

    log(T32ec32T)\displaystyle\log\left(T^{\frac{3}{2}}e^{c_{3}2^{T}}\right) =32loglog2(10log(n(p+q)))+c32log2(10log(n(p+q)))\displaystyle=\frac{3}{2}\log\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil+c_{3}2^{\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil}
    32loglog2(10log(n(p+q)))+20c3log(n(p+q))\displaystyle\leq\frac{3}{2}\log\left\lceil\log_{2}\left(10\sqrt{\log(n(p+q))}\right)\right\rceil+20c_{3}\sqrt{\log(n(p+q))}
    O(1clog(n(p+q))),\displaystyle\leq O\left(\frac{1}{c}\log(n(p+q))\right),

    for any c>0c>0. In particular, by picking c>5c>5 we see that T32ec32TO((n(p+q))1c)o(|Ni|15)T^{\frac{3}{2}}e^{c_{3}2^{T}}\leq O((n(p+q))^{\frac{1}{c}})\leq o(|N_{i}|^{\frac{1}{5}}), and hence we get T32ec32T|NiC0|45|Ni|45o(|Ni|15)=o(|Ni|)T^{\frac{3}{2}}e^{c_{3}2^{T}}|N_{i}\cap C_{0}|^{\frac{4}{5}}\leq|N_{i}|^{\frac{4}{5}}\cdot o(|N_{i}|^{\frac{1}{5}})=o(|N_{i}|).

Combining Equations 19 and 20 we get

kP2exp(ΔikΔij)O(|Ni|),\sum_{k\in P_{2}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|), (21)

and combining Equations 18 and 21 we get

kNiC0exp(ΔikΔij)=kP1exp(ΔikΔij)+kP1exp(ΔikΔij)O(|Ni|).\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})=\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})+\sum_{k\in P_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|). (22)

Now, by Equations 17 and 22 we get

Ω(|NiC0|)kNiC0exp(ΔikΔij)O(|Ni|).\Omega(|N_{i}\cap C_{0}|)\leq\sum_{k\in N_{i}\cap C_{0}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|). (23)

It turns out that repeating the same argument for kNiC1exp(ΔikΔij)\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij}) yields

Ω(|NiC1|)kNiC1exp(ΔikΔij)O(|Ni|).\Omega(|N_{i}\cap C_{1}|)\leq\sum_{k\in N_{i}\cap C_{1}}\exp(\Delta_{ik}-\Delta_{ij})\leq O(|N_{i}|). (24)

Finally, Equations 23 and 24 give us

kNiexp(ΔikΔij)=Θ(|Ni|),\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})=\Theta(|N_{i}|),

which readily implies

γij=1kNiexp(ΔikΔij)=Θ(1/|Ni|)\gamma_{ij}=\frac{1}{\sum_{k\in N_{i}}\exp(\Delta_{ik}-\Delta_{ij})}=\Theta(1/|N_{i}|)

as required. We have showed that for all i𝒜i\in\mathcal{A} and for all jJi,0j\in J_{i,0}, γij=Θ(1/|Ni|)\gamma_{ij}=\Theta(1/|N_{i}|). Repeating the same argument we get that the same result holds for all i𝒜i\in\mathcal{A} and for all jJi,1j\in J_{i,1}, too. Hence, by Claims 20 and 21 about the cardinalities of 𝒜\mathcal{A}, Ji,0J_{i,0} and Ji,1J_{i,1} we have thus proved Theorem 10.

A.7 Proof of Proposition 13

We restate Proposition 13 for convenience.

Proposition 23.

Suppose that p,qp,q satisfy Assumption 1 and that p,qp,q are bounded away from 1. For every ϵ>0\epsilon>0, there are absolute constants M,M=O(ϵ)M,M^{\prime}=O(\sqrt{\epsilon}) such that, with probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the graph attention convolution in (2) and the attention architecture Ψ~\tilde{\Psi} in (13), the model misclassifies at least Ω(n1ϵ)\Omega(n^{1-\epsilon}) nodes for any 𝐰\boldsymbol{w} such that 𝐰=1\|\boldsymbol{w}\|=1, if

  1. 1.

    t=O(1)t=O(1) and 𝝁Mσlognn(p+q)(1max(p,q))p+q|pq|\|\boldsymbol{\mu}\|\leq M\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\frac{p+q}{|p-q|};

  2. 2.

    t=ω(1)t=\omega(1) and 𝝁Mσlognn(p+q)(1max(p,q))\|\boldsymbol{\mu}\|\leq M^{\prime}\sigma\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}.

We start with part 1 of the proposition. Let us assume that pqp\geq q. The result when p<qp<q follows analogously. We will condition on the events 𝓔1,𝓔2,𝓔3\boldsymbol{\mathcal{E}}_{1},\boldsymbol{\mathcal{E}}_{2},\boldsymbol{\mathcal{E}}_{3} defined in Definition 17. These events are concerned with the concentration of class sizes |C0||C_{0}| and |C1||C_{1}| and the concentration of the number of intra-class and inter-class edges, i.e. |C0Ni||C_{0}\cap N_{i}| and C1Ni|C_{1}\cap N_{i}| for all ii. By Lemma 18, the probability that these events hold simultaneously is at least 1o(1)1-o(1). Fix any 𝒘d\boldsymbol{w}\in\mathbb{R}^{d} such that 𝒘=1\|\boldsymbol{w}\|=1. Without loss of generality, assume that 𝒘T𝝁>0\boldsymbol{w}^{T}\boldsymbol{\mu}>0. Because t=O(1)t=O(1), by the definition of Ψ~\tilde{\Psi} in (13) and the attention coefficients in (1) we have that

γij={c1n(p+q)(1±o(1)),if (i,j) is an intra-class edge,c2n(p+q)(1±o(1)),if (i,j) is an inter-class edge,\gamma_{ij}=\left\{\begin{array}[]{ll}\frac{c_{1}}{n(p+q)}(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ \frac{c_{2}}{n(p+q)}(1\pm o(1)),&\mbox{if $(i,j)$ is an inter-class edge},\end{array}\right. (25)

for some positive constants c11c_{1}\geq 1 and c21c_{2}\leq 1. Let us write 𝐗i=(2ϵi1)𝝁+σ𝒈i\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i} where 𝒈iN(0,𝐈)\boldsymbol{g}_{i}\sim N(0,\mathbf{I}), ϵi=0{\epsilon}_{i}=0 if iC0i\in C_{0} and ϵi=1{\epsilon}_{i}=1 if iC1i\in C_{1}. Let us consider the classification of nodes in C0C_{0} based on the model output jNiγij𝒘TXj\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}X_{j} for node ii. Note that, depending on whether iC0i\in C_{0} or iC1i\in C_{1}, the expectation of the model output is symmetric around 0, and therefore we consider the decision boundary at 0. Let 0<ϵ<10<\epsilon<1 and fix any partition of C0C_{0} into a set of disjoint subsets,

C0=h=1C0(h)C_{0}=\bigcup_{h=1}^{\ell}C_{0}^{(h)}

such that =|C0|1ϵ\ell=|C_{0}|^{1-\epsilon} and |C0(h)|=|C0|ϵ|C_{0}^{(h)}|=|C_{0}|^{\epsilon} for every h=1,2,,h=1,2,\ldots,\ell. In what follows we will consider the classification of nodes in each C0(h)C_{0}^{(h)} separately. We will show that, with probability at least 1o(1)1-o(1), for each one of more than half of the subsets C0(h)C_{0}^{(h)} where h=1,2,,h=1,2,\ldots,\ell, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least /2=|C0|1ϵ/2=Ω(n1ϵ)\ell/2=|C_{0}|^{1-\epsilon}/2=\Omega(n^{1-\epsilon}) nodes, giving the required result on misclassification rate.

Fix any h{1,2,,}h^{\prime}\in\{1,2,\ldots,\ell\}. Using (25) we get that, for large enough nn, the event that the model correctly classifies all nodes in C0(h)C_{0}^{(h)} satisfies

{maxiC0(h)jNiγij𝒘T𝐗j<0}={maxiC0(h)(jNiC1γijjNiC0γij)𝒘T𝝁+σjNiγij𝒘T𝒈j<0}{c3(qpp+q)𝒘T𝝁+σmaxiC0(h)jNiγij𝒘T𝒈j<0}\begin{split}&\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{\}}\\ =~{}&\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\bigg{(}\sum_{j\in N_{i}\cap C_{1}}\gamma_{ij}-\sum_{j\in N_{i}\cap C_{0}}\gamma_{ij}\bigg{)}\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\\ \subseteq~{}&\Bigg{\{}c_{3}\left(\frac{q-p}{p+q}\right)\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\end{split}

for some absolute constant c3>0c_{3}>0, and hence the probability that the model correctly classifies all nodes in C0(h)C_{0}^{(h^{\prime})} satisfies, for large enough nn,

𝐏𝐫(maxiC0(h)jNiγij𝒘T𝐗j<0)\displaystyle\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{)} 𝐏𝐫(maxiC0(h)jNi𝒘T𝒈j<c3(pqp+q)|𝒘T𝝁|σ)\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{3}\left(\frac{p-q}{p+q}\right)\frac{|\boldsymbol{w}^{T}\boldsymbol{\mu}|}{\sigma}\Bigg{)}
𝐏𝐫(maxiC0(h)jNi𝒘T𝒈j<M~lognn(p+q)(1max(p,q)))\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\Bigg{)}

where the last inequality follows from our assumption on 𝝁\|\boldsymbol{\mu}\| and we denote M~=defc3M>0\tilde{M}\stackrel{{\scriptstyle\rm def}}{{=}}c_{3}M>0. Now we will use Sudakov’s minoration inequality [45] to obtain a lower bound on the expected maximum, and then apply Borell’s inequality to upper bound the above probability. In order to apply Sudakov’s result we will need to define a metric over the node index set [n][n]. Let 𝒛i=defjNi𝒘T𝒈j\boldsymbol{z}_{i}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}. For i,jC0i,j\in C_{0}, iji\neq j, consider the canonical metric d(i,j)d_{\circ}(i,j) given by

d(i,j)2\displaystyle d_{\circ}(i,j)^{2} =def𝐄[(𝒛i𝒛j)2]\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\mathop{{\bf E}\/}[(\boldsymbol{z}_{i}-\boldsymbol{z}_{j})^{2}]
=kNiγik2+kNjγjk22kNiNjγikγjk\displaystyle=\sum_{k\in N_{i}}\gamma_{ik}^{2}+\sum_{k\in N_{j}}\gamma_{jk}^{2}-2\sum_{k\in N_{i}\cap N_{j}}\gamma_{ik}\gamma_{jk}
c4kJij1n2(p+q)2\displaystyle\geq c_{4}\sum_{k\in J_{ij}}\frac{1}{n^{2}(p+q)^{2}}
=c4|Jij|n2(p+q)2,\displaystyle=\frac{c_{4}|J_{ij}|}{n^{2}(p+q)^{2}},

where Jij=def(NiNj)\(NiNj)J_{ij}\stackrel{{\scriptstyle\rm def}}{{=}}(N_{i}\cup N_{j})\backslash(N_{i}\cap N_{j}) is the symmetric difference of the neighbors of ii and jj, c4>0c_{4}>0 is an absolute constant, and the inequality is due to (25). We lower bound |Jij||J_{ij}| as follows. For i,jC0i,j\in C_{0}, iji\neq j, and a node k[n]k\in[n], the probability that kk is a neighbor of exactly one of ii and jj is 2p(1p)2p(1-p) if kC0k\in C_{0} and 2q(1q)2q(1-q) if kC1k\in C_{1}. Therefore we have 𝐄[|Jij|]=n(p(1p)+q(1q))\mathop{{\bf E}\/}[|J_{ij}|]=n(p(1-p)+q(1-q)). It follows from the multiplicative Chernoff bound that for any 0<δ<10<\delta<1,

𝐏𝐫[|Jij|<𝐄[|Jij|](1δ)]exp(δ2𝐄[|Jij|]/3).\operatorname{{\bf Pr}}[|J_{ij}|<\mathop{{\bf E}\/}[|J_{ij}|](1-\delta)]\leq\exp(-\delta^{2}\mathop{{\bf E}\/}[|J_{ij}|]/3).

Choose

δ=3logn𝐄[|Jij|]=3lognn(p(1p)+q(1q))=o(1),\delta=3\sqrt{\frac{\log n}{\mathop{{\bf E}\/}[|J_{ij}|]}}=3\sqrt{\frac{\log n}{n(p(1-p)+q(1-q))}}=o(1),

where the last equality follows from n(p(1p)+q(1q))=Ω(log2n)n(p(1-p)+q(1-q))=\Omega(\log^{2}n), which is in turn due to the assumptions that p,q=Ω(log2n/n)p,q=\Omega(\log^{2}n/n) and p,qp,q are bounded away from 1. Apply a union bound over all i,jC0i,j\in C_{0}, we get that with probability at least 1o(1)1-o(1), the size of JijJ_{ij} satisfies

|Jij|n(p(1p)+q(1q))(1o(1)),i,jC0.|J_{ij}|\geq n(p(1-p)+q(1-q))(1-o(1)),\forall i,j\in C_{0}. (26)

Therefore it follows that, for large enough nn,

d(i,j)c4|Jij|n2(p+q)2=c4n(p(1p)+q(1q))(1o(1))n2(p+q)2=Ω(1max(p,q)n(p+q)).d_{\circ}(i,j)\geq\sqrt{\frac{c_{4}|J_{ij}|}{n^{2}(p+q)^{2}}}=\sqrt{\frac{c_{4}n(p(1-p)+q(1-q))(1-o(1))}{n^{2}(p+q)^{2}}}=\Omega\left(\sqrt{\frac{1-\max(p,q)}{n(p+q)}}\right).

We condition on the event that the inequality (26) holds for all i,jC0i,j\in C_{0}, which happens with probability at least 1o(1)1-o(1). Apply Sudakov’s minoration with metric d(i,j)d_{\circ}(i,j), we get that for large enough nn, for all h=1,2,,h=1,2,\ldots,\ell,

𝐄[maxiC0(h)jNiγij𝒘T𝒈j]c5d(i,j)log|C0(h)|c6ϵlognn(p+q)(1max(p,q))\mathop{{\bf E}\/}\left[\max_{i\in C_{0}^{(h)}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\right]\geq c_{5}d_{\circ}(i,j)\sqrt{\log|C_{0}^{(h)}|}\geq c_{6}\sqrt{\epsilon}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))} (27)

for some absolute constants c5,c6>0c_{5},c_{6}>0. The last inequality in the above follows from |C0(h)|=|C0|ϵ=Ω(nϵ)|C_{0}^{(h)}|=|C_{0}|^{\epsilon}=\Omega(n^{\epsilon}). In addition, note that since by assumption Ψ\Psi is independent from the node features, using (25) we have that jNiγij𝒘T𝒈j\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j} is Gaussian with variance O(1n(p+q))O(\frac{1}{n(p+q)}). We use Borell’s inequality ([2] Chapter 2) to get that for any t>0t>0 and large enough nn,

𝐏𝐫[maxiC0(h)jNiγij𝒘T𝒈j<𝐄[maxiC0(h)jNiγij𝒘T𝒈j]t]exp(c7t2n(p+q)).\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\mathop{{\bf E}\/}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\right]-t\right]\leq\exp(-c_{7}t^{2}n(p+q)).

for some absolute constant c7>0c_{7}>0. By the lower bound of the expectation (27), we get that

𝐏𝐫[maxiC0(h)jNiγij𝒘T𝒈j<c6ϵlognn(p+q)(1max(p,q))t]exp(c7t2n(p+q)).\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{6}\sqrt{\epsilon}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}-t\right]\leq\exp(-c_{7}t^{2}n(p+q)).

Now, let M>0M>0 be any constant that also satisfies M<c6ϵ/c3M<c_{6}\sqrt{\epsilon}/c_{3}. Recall that we defined M~=c3M\tilde{M}=c_{3}M, and hence M~<c6ϵ\tilde{M}<c_{6}\sqrt{\epsilon}. Set

t=(c6ϵM~)lognn(p+q)(1max(p,q))=Ω(lognn(p+q)(1max(p,q))),t=(c_{6}\sqrt{\epsilon}-\tilde{M})\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}=\Omega\left(\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\right), (28)

then combine with the events we have conditioned so far we get

𝐏𝐫[maxiC0(h)jNiγij𝒘T𝒈jM~lognn(p+q)(1max(p,q))]=o(1).\operatorname{{\bf Pr}}\left[\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}\leq\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\right]=o(1).

Recall that the above probability is the probability of correctly classifying all nodes in C0(h)C_{0}^{(h^{\prime})}. Since our choice of hh^{\prime} was arbitrary, this applies to every h{1,2,,h}h\in\{1,2,\ldots,h\}. Let I0(h)I_{0}^{(h)} denote the indicator variable of the event that there is at least one node in C0(h)C_{0}^{(h)} that is misclassified, then

𝐄[h=1I0(h)]=𝐄[I0(1)]=(1o(1))=o().\mathop{{\bf E}\/}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\right]=\ell\cdot\mathop{{\bf E}\/}[I_{0}^{(1)}]=\ell\cdot(1-o(1))=\ell-o(\ell).

Apply the reverse Markov inequality, we get

𝐏𝐫[h=1I0(h)12]𝐄[h=1I0(h)]12=o()12=o(1).\operatorname{{\bf Pr}}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\leq\frac{1}{2}\ell\right]\leq\frac{\ell-\mathop{{\bf E}\/}\left[\sum_{h=1}^{\ell}I_{0}^{(h)}\right]}{\ell-\frac{1}{2}\ell}=\frac{o(\ell)}{\frac{1}{2}\ell}=o(1).

Therefore, with probability at least 1o(1)1-o(1), we have h=1I0(h)12=Ω(n1ϵ)\sum_{h=1}^{\ell}I_{0}^{(h)}\geq\frac{1}{2}\ell=\Omega(n^{1-\epsilon}). This implies the required result that the model misclassifies at least Ω(n1ϵ)\Omega(n^{1-\epsilon}) nodes.

The proof of part 2 is similar to the proof of part 1. Let us assume that pqp\geq q since the result when p<qp<q can be proved analogously. We condition on the events 𝓔1,𝓔2,𝓔3\boldsymbol{\mathcal{E}}_{1},\boldsymbol{\mathcal{E}}_{2},\boldsymbol{\mathcal{E}}_{3} defined in Definition 17 which simultaneous hold with probability at least 1o(1)1-o(1) by Lemma 18. Fix any 𝒘d\boldsymbol{w}\in\mathbb{R}^{d} such that 𝒘=1\|\boldsymbol{w}\|=1. Because t=ω(1)t=\omega(1), by the definition of Ψ~\tilde{\Psi} in (13) and the attention coefficients in (1) we have that

γij={2np(1±o(1)),if (i,j) is an intra-class edge,o(1n(p+q)),if (i,j) is an inter-class edge.\gamma_{ij}=\left\{\begin{array}[]{ll}\frac{2}{np}(1\pm o(1)),&\mbox{if $(i,j)$ is an intra-class edge},\\ o(\frac{1}{n(p+q)}),&\mbox{if $(i,j)$ is an inter-class edge}.\end{array}\right. (29)

Write 𝐗i=(2ϵi1)𝝁+σ𝒈i\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\boldsymbol{\mu}+\sigma\boldsymbol{g}_{i} where 𝒈iN(0,𝐈)\boldsymbol{g}_{i}\sim N(0,\mathbf{I}), ϵi=0{\epsilon}_{i}=0 if iC0i\in C_{0} and ϵi=1{\epsilon}_{i}=1 if iC1i\in C_{1}. We consider the classification of nodes in C0C_{0} based on the model output jNiγij𝒘TXj\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}X_{j} for node ii and the decision boundary at 0. As before, let 0<ϵ<10<\epsilon<1 and fix any partition of C0C_{0} into a set of disjoint subsets C0=h=1C0(h)C_{0}=\bigcup_{h=1}^{\ell}C_{0}^{(h)} such that =|C0|1ϵ\ell=|C_{0}|^{1-\epsilon} and |C0(h)|=|C0|ϵ|C_{0}^{(h)}|=|C_{0}|^{\epsilon} for every h=1,2,,h=1,2,\ldots,\ell. We proceed to show that, with high probability, for each one of more than half of the subsets C0(h)C_{0}^{(h)} where h=1,2,,h=1,2,\ldots,\ell, the model is going to misclassify at least one node, and consequently, the model is going to misclassify at least /2=Ω(n1ϵ)\ell/2=\Omega(n^{1-\epsilon}) nodes as required. Fix any h{1,2,,}h^{\prime}\in\{1,2,\ldots,\ell\}. Using (29) we get that, for large enough nn, the event that the model correctly classifies all nodes in C0(h)C_{0}^{(h^{\prime})} satisfies

{maxiC0(h)jNiγij𝒘T𝐗j<0}{c1𝒘T𝝁+σmaxiC0(h)jNiγij𝒘T𝒈j<0}\begin{split}\Bigg{\{}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{\}}\subseteq\Bigg{\{}c_{1}\boldsymbol{w}^{T}\boldsymbol{\mu}+\sigma\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<0\Bigg{\}}\end{split}

for some absolute constant c1>0c_{1}>0, and hence the probability that the model classifies all nodes in C0hC_{0}^{h^{\prime}} correctly satisfies, for large enough nn,

𝐏𝐫(maxiC0(h)jNiγij𝒘T𝐗j<0)\displaystyle\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\gamma_{ij}\boldsymbol{w}^{T}\mathbf{X}_{j}<0\Bigg{)} 𝐏𝐫(maxiC0(h)jNi𝒘T𝒈j<c1|𝒘T𝝁|σ)\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<c_{1}\frac{|\boldsymbol{w}^{T}\boldsymbol{\mu}|}{\sigma}\Bigg{)}
𝐏𝐫(maxiC0(h)jNi𝒘T𝒈j<M~lognn(p+q)(1max(p,q)))\displaystyle\leq\operatorname{{\bf Pr}}\Bigg{(}\max_{i\in C_{0}^{(h^{\prime})}}\sum_{j\in N_{i}}\boldsymbol{w}^{T}\boldsymbol{g}_{j}<\tilde{M}\sqrt{\frac{\log n}{n(p+q)}(1-\max(p,q))}\Bigg{)}

where the last inequality follows from our assumption on 𝝁\|\boldsymbol{\mu}\| and we denote M~=defc1M>0\tilde{M}\stackrel{{\scriptstyle\rm def}}{{=}}c_{1}M^{\prime}>0. The rest of the proof of part 2 proceeds as the proof of part 1.

A.8 Proof of Theorem 14

We restate Theorem 14 for convenience. Recall that we write 𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for some κ>0\kappa>0.

Theorem.

With probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) with R=Ω(nlog2n/σ)R=\Omega(n\log^{2}n/\sigma), the graph attention convolution output satisfies

hi=jNiγij𝒘~T𝐗j>0if and only if𝒘~T𝐗i>0,i[n],\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,\;\forall i\in[n],
hi=jNiγij𝒘~T𝐗j<0if and only if𝒘~T𝐗i<0,i[n].\displaystyle h_{i}^{\prime}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<0\;\mbox{if and only if}\;\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0,\;\forall i\in[n].

We will condition on the following two events concerning the values of 𝒘~T𝐗i\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i} for i[n]i\in[n]. Both events hold with probability at least 1o(1)1-o(1). First, for ϵ>0\epsilon>0 consider the event

{𝒘~T𝐗i[ϵσmax{1,κ},ϵσmax{1,κ}]for alli[n]}.\left\{\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\not\in[-\epsilon\sigma\max\{1,\kappa\},\epsilon\sigma\max\{1,\kappa\}]\ \mbox{for all}\ i\in[n]\right\}.

Since 𝒘~=𝝁/𝝁\tilde{\boldsymbol{w}}=\boldsymbol{\mu}/\|\boldsymbol{\mu}\| and 𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma, each 𝒘~T𝐗i\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i} follows a normal distribution with mean ϵiκσ\epsilon_{i}\kappa\sigma and variance σ2\sigma^{2} (recall that ϵi{0,1}\epsilon_{i}\in\{0,1\} generates the node memberships in CSBM), we have that

𝐏𝐫(ϵσmax{1,κ}𝒘~T𝐗iϵσmax{1,κ})\displaystyle\operatorname{{\bf Pr}}\left(-\epsilon\sigma\max\{1,\kappa\}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq\epsilon\sigma\max\{1,\kappa\}\right) =κϵmax{1,κ}κ+ϵmax{1,κ}12πex2/2𝑑x\displaystyle=\int_{\kappa-\epsilon\max\{1,\kappa\}}^{\kappa+\epsilon\max\{1,\kappa\}}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}dx
{2πϵ,ifκ1,2πeκ2(1ϵ)2/2ϵκ,ifκ>1.\displaystyle\leq\left\{\begin{array}[]{ll}\sqrt{\frac{2}{\pi}}\epsilon,&\mbox{if}\ \kappa\leq 1,\\ \sqrt{\frac{2}{\pi}}e^{-\kappa^{2}(1-\epsilon)^{2}/2}\epsilon\kappa,&\mbox{if}\ \kappa>1.\end{array}\right.

Let us pick ϵ\epsilon such that

ϵ=1nlognifκ2logn,andϵ=14ifκ>2logn.\displaystyle\epsilon=\frac{1}{n\sqrt{\log n}}\;\ \mbox{if}\;\ \kappa\leq 2\sqrt{\log n},\quad\mbox{and}\quad\epsilon=\frac{1}{4}\;\ \mbox{if}\;\ \kappa>2\sqrt{\log n}.

This gives that

𝐏𝐫(𝒘~T𝐗i[ϵσmax{1,κ},ϵσmax{1,κ}]for alli[n])1no(1/n)=1o(1).\operatorname{{\bf Pr}}\left(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\not\in[-\epsilon\sigma\max\{1,\kappa\},\epsilon\sigma\max\{1,\kappa\}]\ \mbox{for all}\ i\in[n]\right)\geq 1-n\cdot o(1/n)=1-o(1).

Second, consider the event that |𝒘~T𝐗i𝐄[𝒘~T𝐗i]|10σlogn|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-\mathop{{\bf E}\/}[\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}]|\leq 10\sigma\sqrt{\log n}. By Lemma 18 we know that it holds with probability at least 1o(1)1-o(1). Under these two events, we have that

ϵσmax{1,κ}𝒘~T𝐗iκσ+10σlogn\displaystyle\epsilon\sigma\max\{1,\kappa\}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq\kappa\sigma+10\sigma\sqrt{\log n}\quad if𝒘~T𝐗i>0,\displaystyle\mbox{if}\quad\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0,
κσ10σlogn𝒘~T𝐗iϵσmax{1,κ}\displaystyle-\kappa\sigma-10\sigma\sqrt{\log n}\leq\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\leq-\epsilon\sigma\max\{1,\kappa\}\quad if𝒘~T𝐗i<0.\displaystyle\mbox{if}\quad\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0.

Recall from (5) that

Ψ(𝐗i,𝐗j)={2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|,2R(1β)sign(𝒘~T𝐗i)𝒘~T𝐗j,if|𝒘~T𝐗i|<𝒘~T𝐗j<|𝒘~T𝐗i|,2R(1β)𝒘~T𝐗i,if𝒘~T𝐗j|𝒘~T𝐗i|.\Psi(\mathbf{X}_{i},\mathbf{X}_{j})=\left\{\begin{array}[]{ll}-2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\leq-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\operatorname{sign}(\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i})\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j},&\mbox{if}~{}-\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|<\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}<\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|,\\ \hskip 8.53581pt2R(1-\beta)\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i},&\mbox{if}~{}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\geq\left|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}\right|.\end{array}\right.

Consider an arbitrary node ii where 𝒘~T𝐗i>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0. We will show that hi=jNiγij𝒘~T𝐗j>0h_{i}=\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0. By the definition of attention coefficients in (1), we know that

jNiγij𝒘~T𝐗j>0jNiexp(Ψ(𝐗i,𝐗j))𝒘~T𝐗j>0.\sum_{j\in N_{i}}\gamma_{ij}\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0\iff\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0.

Using the above expression for Ψ\Psi, we have that

jNiexp(Ψ(𝐗i,𝐗j))𝒘~T𝐗j=exp(Ψ(𝐗i,𝐗i))𝒘~T𝐗i+jNijiexp(Ψ(𝐗i,𝐗j))𝒘~T𝐗jexp(Ψ(𝐗i,𝐗i))𝒘~T𝐗i(n1)exp(Ψ(𝐗i,𝐗i))(κσ+10σlogn)e2R(1β)ϵσmax{1,κ}ϵσmax{1,κ}(n1)e2R(1β)ϵσmax{1,κ}(κσ+10σlogn),\begin{split}&\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\\ =~{}&\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{i}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}+\sum_{\begin{subarray}{c}j\in N_{i}\\ j\neq i\end{subarray}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}\\ \geq~{}&\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{i}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}-(n-1)\exp(\Psi(\mathbf{X}_{i},-\mathbf{X}_{i}))(\kappa\sigma+10\sigma\sqrt{\log n})\\ \geq~{}&e^{2R(1-\beta)\epsilon\sigma\max\{1,\kappa\}}\epsilon\sigma\max\{1,\kappa\}-(n-1)e^{-2R(1-\beta)\epsilon\sigma\max\{1,\kappa\}}(\kappa\sigma+10\sigma\sqrt{\log n}),\end{split} (30)

where the second last inequality follows from the event that |𝒘~T𝐗j|κσ+10σlogn|\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}|\leq\kappa\sigma+10\sigma\sqrt{\log n} for all jj, and |Ni|n|N_{i}|\leq n, and the last inequality follows from the fact that the function

f(x)=xe2R(1β)x(n1)(κσ+10σlogn)e2R(1β)xf(x)=xe^{2R(1-\beta)x}-(n-1)(\kappa\sigma+10\sigma\sqrt{\log n})e^{-2R(1-\beta)x}

is increasing with respect to xx for x0x\geq 0. To see that hi=jNiexp(Ψ(𝐗i,𝐗j))𝒘~T𝐗j>0h_{i}^{\prime}=\sum_{j\in N_{i}}\exp(\Psi(\mathbf{X}_{i},\mathbf{X}_{j}))\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{j}>0, consider the following cases separately.

  • If κ2logn\kappa\leq 2\sqrt{\log n}, recall that we picked ϵ=1/(nlogn)\epsilon=1/(n\log n), then because R=Ω(nlog2n/σ)R=\Omega(n\log^{2}n/\sigma), for large enough nn we have that

    4R(1β)σ>nlognmax{1,κ}(logn+log(nlognmax{1,κ})+log(κ+10logn))\displaystyle 4R(1-\beta)\sigma>\frac{n\sqrt{\log n}}{\max\{1,\kappa\}}\left(\log n+\log\left(\frac{n\sqrt{\log n}}{\max\{1,\kappa\}}\right)+\log(\kappa+10\sqrt{\log n})\right)
    \displaystyle\iff 4R(1β)σ>1ϵmax{1,κ}(logn+log(1ϵmax{1,κ})+log(κ+10logn))\displaystyle 4R(1-\beta)\sigma>\frac{1}{\epsilon\max\{1,\kappa\}}\left(\log n+\log\left(\frac{1}{\epsilon\max\{1,\kappa\}}\right)+\log(\kappa+10\sqrt{\log n})\right)
    \displaystyle\iff e4R(1β)σϵmax{1,κ}>n(κ+10logn)ϵmax{1,κ}\displaystyle e^{4R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}>\frac{n(\kappa+10\sqrt{\log n})}{\epsilon\max\{1,\kappa\}}
    \displaystyle\iff e2R(1β)σϵmax{1,κ}σϵmax{1,κ}>ne2R(1β)σϵmax{1,κ}(κσ+10σlogn).\displaystyle e^{2R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}\sigma\epsilon\max\{1,\kappa\}>ne^{-2R(1-\beta)\sigma\epsilon\max\{1,\kappa\}}(\kappa\sigma+10\sigma\sqrt{\log n}).

    By (30) this means that hi>0h_{i}^{\prime}>0.

  • If κ>2logn\kappa>2\sqrt{\log n}, then ϵ=1/4\epsilon=1/4. Because R=Ω(nlog2n/σ)R=\Omega(n\log^{2}n/\sigma), for large enough nn we have

    4R(1β)σ>logn+log(κ+10logn)log(κ/4)κ/4\displaystyle 4R(1-\beta)\sigma>\frac{\log n+\log(\kappa+10\sqrt{\log n})-\log(\kappa/4)}{\kappa/4}
    \displaystyle\iff 4R(1β)σ>logn+log(κ+10logn)log(ϵκ)ϵκ\displaystyle 4R(1-\beta)\sigma>\frac{\log n+\log(\kappa+10\sqrt{\log n})-\log(\epsilon\kappa)}{\epsilon\kappa}
    \displaystyle\iff e4R(1β)σϵκ>n(κ+10logn)ϵκ\displaystyle e^{4R(1-\beta)\sigma\epsilon\kappa}>\frac{n(\kappa+10\sqrt{\log n})}{\epsilon\kappa}
    \displaystyle\iff e2R(1β)σϵκσϵκ>n(κσ+10σlogn)\displaystyle e^{2R(1-\beta)\sigma\epsilon\kappa}\sigma\epsilon\kappa>n(\kappa\sigma+10\sigma\sqrt{\log n})

    By (30) this means that hi>0h_{i}^{\prime}>0.

This shows that hi>0h_{i}^{\prime}>0 whenever 𝒘~T𝐗i>0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}>0. Similarly, one easily gets that hi<0h_{i}^{\prime}<0 whenever 𝒘~T𝐗i<0\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}<0. Therefore the proof is complete.

A.9 Proof of Corollary 16

We restate Corollary 16 for convenience. Recall that we write 𝝁=κσ\|\boldsymbol{\mu}\|=\kappa\sigma for some κ>0\kappa>0.

Corollary.

With probability at least 1o(1)1-o(1) over the data (𝐗,𝐀)CSBM(n,p,q,𝛍,σ2)(\mathbf{X},\mathbf{A})\sim\textsf{CSBM}(n,p,q,\boldsymbol{\mu},\sigma^{2}), using the two-layer MLP attention architecture Ψ\Psi given in (3) and (4) with R=Ω(nlog2nσ)R=\Omega(n\log^{2}n\sigma), one has that

  • (Perfect classification) If κ2logn\kappa\geq\sqrt{2\log n} then all nodes are correctly classified;

  • (Almost perfect classification) If κ=ω(1)\kappa=\omega(1) then at least 1o(1)1-o(1) fraction of all nodes are correctly classified;

  • (Partial classification) If κ=O(1)\kappa=O(1) then at least Φ(κ)o(1)\Phi(\kappa)-o(1) fraction of all nodes are correctly classified.

Fix i[n]i\in[n] and write 𝒘~T𝐗i=(2ϵi1)𝝁+σ/𝝁𝝁T𝒈i\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i}=(2{\epsilon}_{i}-1)\|\boldsymbol{\mu}\|+\sigma/\|\boldsymbol{\mu}\|\cdot\boldsymbol{\mu}^{T}\boldsymbol{g}_{i} where 𝒈iN(0,𝐈)\boldsymbol{g}_{i}\sim N(0,\mathbf{I}), where we recall that ϵi{0,1}\epsilon_{i}\in\{0,1\} defines the class membership of node ii. We have that 𝒘~T𝐗i\tilde{\boldsymbol{w}}^{T}\mathbf{X}_{i} follows a normal distribution with mean (2ϵi1)κσ(2{\epsilon}_{i}-1)\kappa\sigma and standard deviation σ\sigma. Part 1 of the corollary is exactly Proposition 7 whose proof is given in the main text. We consider the other cases for κ\kappa. For both classes ϵi=0\epsilon_{i}=0 and ϵi=1\epsilon_{i}=1, the probability of correct classification is Φ(κ)\Phi(\kappa) (using 0 as the decision boundary). Therefore, by applying additive Chernoff bound, we have that

𝐏𝐫[At most nΦ(κ)nlogn nodes are correctly classified]\displaystyle\mathop{{\bf Pr}\/}\bigg{[}\mbox{At most $n\Phi(\kappa)-\sqrt{n\log n}$ nodes are correctly classified}\bigg{]}
=\displaystyle=~{} 𝐏𝐫[i[n]𝟏{node i is correctly classified}nΦ(κ)nlogn]2n.\displaystyle\mathop{{\bf Pr}\/}\left[\sum_{i\in[n]}\boldsymbol{1}_{\{\text{node $i$ is correctly classified}\}}\leq n\Phi(\kappa)-\sqrt{n\log n}\right]\leq\frac{2}{n}.

The proof is complete by noticing that nΦ(κ)nlogn=n(1o(1))n\Phi(\kappa)-\sqrt{n\log n}=n(1-o(1)) when κ=ω(1)\kappa=\omega(1) and nΦ(κ)nlogn=n(Φ(κ)o(1))n\Phi(\kappa)-\sqrt{n\log n}=n(\Phi(\kappa)-o(1)) when κ=O(1)\kappa=O(1).

References

  • [1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:1–86, 2018.
  • [2] Robert J Adler, Jonathan E Taylor, et al. Random fields and geometry, volume 80. Springer, 2007.
  • [3] T.W. Anderson. An introduction to multivariate statistical analysis. John Wiley & Sons, 2003.
  • [4] A. Athreya, D. E. Fishkind, M. Tang, C. E. Priebe, Y. Park, J. T. Vogelstein, K. Levin, V. Lyzinski, Y. Qin, and D. L. Sussman. Statistical inference on random dot product graphs: A survey. Journal of Machine Learning Research, 18(226):1–92, 2018.
  • [5] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), page 2001–2009, 2016.
  • [6] D. Bahdanau, K. H. Cho, , and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.
  • [7] A. Baranwal, K. Fountoulakis, and A. Jagannath. Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pages 684–693, 2021.
  • [8] N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral clustering. Biometrika, 104:361–377, 2017.
  • [9] Cristian Bodnar, Francesco Di Giovanni, Benjamin Chamberlain, Pietro Lio, and Michael Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. Advances in Neural Information Processing Systems (NeurIPS), 35:18527–18541, 2022.
  • [10] X. Bresson and T. Laurent. Residual gated graph convnets. In arXiv:1711.07553, 2018.
  • [11] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks. In International Conference on Learning Representations (ICLR), 2022.
  • [12] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
  • [13] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
  • [14] Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • [15] E. Chien, J. Peng, P. Li, and O. Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021.
  • [16] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), page 3844–3852, 2016.
  • [17] Y. Deshpande, A. Montanari S. Sen, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [18] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems (NeurIPS), volume 45, page 2224–2232, 2015.
  • [19] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [20] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 119, pages 3419–3430, 2020.
  • [21] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks (IJCNN), 2005.
  • [22] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 1025–1035, 2017.
  • [23] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. In arXiv:1506.05163, 2015.
  • [24] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. B. Ma, H. Chen, and M.-C. Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • [25] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [26] S. Jegelka. Theory of graph neural networks: Representation and learning. In arXiv:2204.07697, 2022.
  • [27] N. Keriven, A. Bietti, and S. Vaiter. On the universality of graph neural networks on large random graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [28] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [29] B. Knyazev, G. W. Taylor, and M. Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 4202–4212, 2019.
  • [30] B. J. Lee, R. A. Rossi, S. Kim, K. N. Ahmed, and E. Koh. Attention models in graphs: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 2019.
  • [31] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97–109, 2018.
  • [32] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016.
  • [33] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 20887–20902, 2021.
  • [34] A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [35] A. Loukas. What graph neural networks cannot learn: Depth vs width. In International Conference on Learning Representations (ICLR), 2020.
  • [36] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 1362–1375, 2022.
  • [37] S. Maskey, R. Levie, Y. Lee, and G. Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • [38] M. Minsky and S. Papert. Perceptron: an introduction to computational geometry, 1969.
  • [39] C. Moore. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. Bulletin of The European Association for Theoretical Computer Science, 1(121), 2017.
  • [40] John Palowitch, Anton Tsitsulin, Brandon Mayer, and Bryan Perozzi. Graphworld: Fake graphs bring real insights for gnns. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 3691–3701, 2022.
  • [41] O. Puny, H. Ben-Hamu, and Y. Lipman. Global attention improves graph networks generalization. In arXiv:2006.07846, 2020.
  • [42] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 2009.
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), page 6000–6010, 2017.
  • [44] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • [45] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [46] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
  • [47] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
  • [48] Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In IEEE International Conference on Data Mining (ICDM), pages 1287–1292. IEEE, 2022.
  • [49] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems (NeurIPS), 2020.