A Practical, Progressively-Expressive GNN

Lingxiao Zhao
Carnegie Mellon University
[email protected]
&Louis Härtel
RWTH Aachen University
[email protected]
&Neil Shah
Snap Inc.
[email protected]
&Leman Akoglu
Carnegie Mellon University
[email protected]

Abstract

Message passing neural networks (MPNNs) have become a dominant flavor of graph neural networks (GNNs) in recent years. Yet, MPNNs come with notable limitations; namely, they are at most as powerful as the 1-dimensional Weisfeiler-Leman (1-WL) test in distinguishing graphs in a graph isomorphism testing framework. To this end, researchers have drawn inspiration from the $k$ -WL hierarchy to develop more expressive GNNs. However, current $k$ -WL-equivalent GNNs are not practical for even small values of $k$ , as $k$ -WL becomes combinatorially more complex as $k$ grows. At the same time, several works have found great empirical success in graph learning tasks without highly expressive models, implying that chasing expressiveness with a “coarse-grained ruler” of expressivity like $k$ -WL is often unneeded in practical tasks. To truly understand the expressiveness-complexity tradeoff, one desires a more “fine-grained ruler,” which can more gradually increase expressiveness. Our work puts forth such a proposal: Namely, we first propose the $(k,c)(\leq)$ -SetWL hierarchy with greatly reduced complexity from $k$ -WL, achieved by moving from $k$ -tuples of nodes to sets with ${\leq}k$ nodes defined over ${\leq}c$ connected components in the induced original graph. We show favorable theoretical results for this model in relation to $k$ -WL, and concretize it via $(k,c)(\leq)$ -SetGNN, which is as expressive as $(k,c)(\leq)$ -SetWL. Our model is practical and progressively-expressive, increasing in power with $k$ and $c$ . We demonstrate effectiveness on several benchmark datasets, achieving several state-of-the-art results with runtime and memory usage applicable to practical graphs. We open source our implementation at https://github.com/LingxiaoShawn/KCSetGNN.

1 Introduction

In recent years, graph neural networks (GNNs) have gained considerable attention [58, 60] for their ability to tackle various node-level [35, 56], link-level [66, 51] and graph-level [9, 40] learning tasks, given their ability to learn rich representations for complex graph-structured data. The common template for designing GNNs follows the message passing paradigm; these so-called message-passing neural networks (MPNNs) are built by stacking layers which encompass feature transformation and aggregation operations over the input graph [39, 25]. Despite their advantages, MPNNs have several limitations including oversmoothing [67, 12, 47], oversquashing [2, 55], inability to distinguish node identities [63] and positions [62], and expressive power [61].

Since Xu et al.’s [61] seminal work showed that MPNNs are at most as powerful as the first-order Weisfeiler-Leman (1-WL) test in the graph isomorphism (GI) testing framework, there have been several follow-up works on improving the understanding of GNN expressiveness [3, 15]. In response, the community proposed many GNN models to overcome such limitations [52, 4, 68]. Several of these aim to design powerful higher-order GNNs which are increasingly expressive [43, 41, 33, 9] by inspiration from the $k$ -WL hierarchy [54].

A key limitation of such higher-order models that reach beyond 1-WL is their poor scalability; in fact, these models can only be applied to small graphs with small $k$ in practice [40, 41, 43] due to combinatorial growth in complexity. On the other hand lies an open question on whether practical graph learning tasks indeed need such complex, and extremely expressive GNN models. Historically, Babai et al. [5] showed that almost all graphs on $n$ nodes can be distinguished by 1-WL. In other contexts like node classification, researchers have encountered superior generalization performance with graph-augmented multi-layer perceptron (GA-MLP) models [50, 37] compared to MPNNs, despite the former being strictly less expressive than the latter [13]. Considering that increased model complexity has negative implications in terms of overfitting and generalization [29], it is worth re-evaluating continued efforts to pursue maximally expressive GNNs in practice. Ideally, one could study these models’ generalization performance on various real-world tasks by increasing $k$ (expressiveness) in the $k$ -WL hierarchy. Yet, given the impracticality of this too-coarse “ruler” of expressiveness, which becomes infeasible beyond $k$ $>$ $3$ , one desires a more fine-grained “ruler”, that is, a new hierarchy whose expressiveness grows more gradually. Such a hierarchy could enable us to gradually build more expressive models which admit improved scaling, and avoid unnecessary leaps in model complexity for tasks which do not require them, guarding against overfitting.

Refer to caption — Figure 1: Main steps of $(k,c)(\leq)$ -SetGNN. Given a graph and ( $k,c$ ), we build the $(k,c)$ -bipartite super-graph (in middle) containing sets with up to $k$ nodes and $c$ connected components in the induced subgraph, on which a base GNN assigns initial “colors”. Bidirectional bipartite GNN layers with frw.-bckw. message passing learn set embeddings, pooled into graph embedding. The size of super-graph, and accordingly its expressiveness, grows progressively with increasing $k$ and $c$ . The $2$ -bipartite message passing generalizes normal GNN, edge GNN, and line graph (see Appendix.A.12).

Present Work. In this paper, we propose such a hierarchy, and an associated practical progressively-expressive GNN model, called $(k,c)(\leq)$ -SetGNN, whose expressiveness can be modulated through $k$ and $c$ . In a nutshell, we take inspiration from $k$ -WL, yet achieve practicality through three design ideas which simplify key bottlenecks of scaling $k$ -WL: First, we move away from $k$ -tuples of the original $k$ -WL to $k$ -multisets (unordered), and then to $k(\leq)$ -sets. We demonstrate that these steps drastically reduce the number nodes in the $k$ -WL graph while retaining significant expressiveness. (Sec.s 3.2 $-$ 3.3) Second, by considering the underlying sparsity of an input graph, we reduce scope to $k,c(\leq)$ -sets that consist of ${\leq}k$ nodes whose induced subgraph is comprised of ${\leq}{c}{\leq}{k}$ connected components. This also yields massive reduction in the number of nodes while improving practicality; i.e. small values of $c$ on sparse graphs can allow one to increase $k$ well beyond $3$ in practice. (Sec. 3.4) Third, we design a super-graph architecture that consists of a sequence of $k$ $-$ $1$ bipartite graphs over which we learn embeddings for our $k,c(\leq)$ -sets, using bidirectional message passing. These embeddings can be pooled to yield a final graph embedding.. (Sec.s 4.1 $-$ 4.2) We also speed up initializing “colors” for the $k,c(\leq)$ -sets, for $c>1$ , substantially reducing computation and memory. (Sec. 4.3) Fig. 1 overviews of our proposed framework. Experimentally, our $(k,c)(\leq)$ -SetGNN outperforms existing state-of-the-art GNNs on simulated expressiveness as well as real-world graph-level tasks, achieving new bests on substructure counting and ZINC-12K, respectively. We show that generalization performance reflects increasing expressiveness by $k$ and $c$ . Our proposed scalable designs allow us to train models with e.g. $k$ $=$ $10$ or $c$ $=$ $3$ with practical running time and memory requirements.

2 Related Work

MPNN limitations. Given the understanding that MPNNs have expressiveness upper-bounded by 1-WL [61], several researchers investigated what else MPNNs cannot learn. To this end, Chen et al. [15] showed that MPNNs cannot count induced connected substructures of 3 or more nodes, while along with [15], Arvind et al. [3] showed that MPNNs can only count star-shaped patterns. Loukas [38] further proved several results regarding decision problems on graphs (e.g. subgraph verification, cycle detection), finding that MPNNs cannot solve these problems unless strict conditions of depth and width are met. Moreover, Garg et al. [22] showed that many standard MPNNs cannot compute properties such as longest or shortest cycles, diameters or existence of cliques.

Improving expressivity. Several works aim to improve expressiveness limitations of MPNNs. One approach is to inject features into the MPNN aggregation, motivated by Loukas [38] who showed that MPNNs can be universal approximators when nodes are sufficiently distinguishable. Sato et al. [53] show that injecting random features can better enable MPNNs to solve problems like minimum dominating set and maximum matching. You et al. [63] inject cycle counts as node features, motivated by the limitation of MPNNs not being able to count cycles. Others proposed utilizing subgraph isomorphism counting to empower MPNNs with substructure information they cannot learn [10, 9]. Earlier work by You et al. [62] adds positional features to distinguish node which naïve MPNN embeddings would not. Recently, several works also propose utilizing subgraph aggregation; Zhang et al. [65], Zhao et al. [68] and Bevilacqua et al. [8] propose subgraph variants of the WL test which are no less powerful than 3-WL. Recently a following up work [21] shows that rooted subgraph based extension with 1-WL as kernel is bounded by 3-WL. The community is yet exploring several distinct avenues in overcoming limitations: Murphy et al. [46] propose a relational pooling mechanism which sums over all permutations of a permutation-sensitive function to achieve above-1-WL expressiveness. Balcilar et al. [6] generalizes spatial and spectral MPNNs and shows that instances of spectral MPNNs are more powerful than 1-WL. Azizian et al. [4] and Geerts et al. [24] unify expressivity and approximation ability results for existing GNNs.

$k$ -WL-inspired GNNs. The $k$ -WL test captures higher-order interactions in graph data by considering all $k$ -tuples, i.e., size $k$ ordered multisets defined over the set of nodes. While highly expressive, it does not scale to practical graphs beyond a very small $k$ ( $k=3$ pushes the practical limit even for small graphs). However, several works propose designs which can achieve $k$ -WL in theory and strive to make them practical. Maron et al. [41] proposes a general class of invariant graph networks $k$ -IGNs having exact $k$ -WL expressivity [23], while being not scalable. Morris et al. have several work on $k$ -WL and its variants like $k$ -LWL [42], $k$ -GNN [43], and $\delta$ - $k$ -WL-GNN [44]. Both $k$ -LWL and $k$ -GNN use a variant of $k$ -WL that only considers $k$ -sets and are strictly weaker than $k$ -WL but much more practical. In our paper we claim that $k(\leq)$ -sets should be used with additional designs to keep the best expressivity while remaining practical. The $\delta$ - $k$ -WL-GNN works with $k$ -tuples but sparsifies the connections among tuples by considering locality. Qian et al. [48] extends the k-tuple to subgraph network but has same scalability bottleneck as k-WL. A recent concurrent work SpeqNet [45] proposes to reduce number of tuples by restricting the number of connected components of tuples’ induced subgraphs. The idea is independently explored in our paper in Sec.3.4. All four variants proposed by Morris et al. do not have realizations beyond $k>3$ in their experiments while we manage to reach $k=10$ . Interestingly, a recent work [34] links graph transformer with $k$ -IGN that having better (or equal) expressivity, however is still not scalable.

3 A practical progressively-expressive isomorphism test: $(k,c)(\leq)$ -SetWL

We first motivate our method $(k,c)(\leq)$ -SetGNN from the GI testing perspective by introducing $(k,c)(\leq)$ -SetWL. We first introduce notation and background of $k$ -WL (Sec. 3.1). Next, we show how to increase the practicality of $k$ -WL without reducing much of its expressivity, by removing node ordering (Sec. 3.2), node repetitions (Sec. 3.3), and leveraging graph sparsity (Sec. 3.4). We then give the complexity analysis (Sec. 3.5). We close the section with extending the idea to $k$ -FWL which is as expressive as $(k$ + $1)$ -WL (Sec. 3.6). All proofs are included in Appendix A.4.

Notation: Let $G=(V(G),E(G),l_{G})$ be an undirected, colored graph with nodes $V(G)$ , edges $E(G)$ , and a color-labeling function $l_{G}:V(G)\rightarrow C$ where $C=\{c_{1},...,c_{d}\}$ denotes a set of $d$ distinct colors. Let $[n]=\{1,2,3,...,n\}$ . Let $(\cdot)$ denote a tuple (ordered multiset), $\{\mskip-5.0mu\{\cdot\}\mskip-5.0mu\}$ denote a multiset (set which allows repetition), and $\{\cdot\}$ denote a set. We define $\overrightarrow{\boldsymbol{v}}=(v_{1},...,v_{k})$ as a $k$ -tuple, $\tilde{\boldsymbol{v}}=\{\mskip-5.0mu\{v_{1},...,v_{k}\}\mskip-5.0mu\}$ as a $k$ -multiset, and $\hat{\boldsymbol{v}}=\{v_{1},...,v_{k}\}$ as a $k$ -set. Let $\overrightarrow{\boldsymbol{v}}[x/i]=(v_{1},...,v_{i-1},x,v_{i+1},...,v_{k})$ denote replacing the $i$ -th element in $\overrightarrow{\boldsymbol{v}}$ with $x$ , and analogously for $\tilde{\boldsymbol{v}}[x/i]$ and $\hat{\boldsymbol{v}}[x/i]$ (assume mutlisets and sets are represented with the canonical ordering). When $v_{i}\in V(G)$ , let $G[\overrightarrow{\boldsymbol{v}}]$ , $G[\tilde{\boldsymbol{v}}]$ , $G[\hat{\boldsymbol{v}}]$ denote the induced subgraph on $G$ with nodes inside $\overrightarrow{\boldsymbol{v}},\tilde{\boldsymbol{v}},\hat{\boldsymbol{v}}$ respectively (keeping repetitions). An isomorphism between two graphs $G$ and $G^{\prime}$ is a bijective mapping $p:V(G)\rightarrow V(G^{\prime})$ which satisfies $\forall u,v\in V(G),(u,v)\in E(G)\Longleftrightarrow(p(u),p(v))\in E(G^{\prime})$ and $\forall u\in V(G),l_{G}(u)=l_{G^{\prime}}(p(u))$ . Two graphs $G,G^{\prime}$ are isomorphic if there exists an isomorphism between them, which we denote as $G\cong G^{\prime}$ .

3.1 Preliminaries: the $k$ -Weisfeiler-Leman ( $k$ -WL) Graph Isomorphism Test

The 1-dimensional Weisfeiler-Leman ( $1$ -WL) test, also known as color refinement [49] algorithm, is a widely celebrated approach to (approximately) test GI. Although extremely fast and effective for most graphs ( $1$ -WL can provide canonical forms for all but $n^{-1/7}$ fraction of $n$ -vertex graphs [5]), it fails to distinguish members of many important graph families, such as regular graphs. The more powerful $k$ -WL algorithm first appeared in [57], and extends coloring of vertices (or 1-tuples) to that of $k$ -tuples. $k$ -WL is progressively expressive with increasing $k$ , and can distinguish any finite set of graphs given a sufficiently large $k$ . $k$ -WL has many interesting connections to logic, games, and linear equations [11, 28]. Another variant of $k$ -WL is called $k$ -FWL (Folklore WL), such that $(k+1)$ -WL is equally expressive as $k$ -FWL [11, 26]. Our work focuses on $k$ -WL.

$k$ -WL iteratively recolors all $n^{k}$ $k$ -tuples defined on a graph $G$ with $n$ nodes. At iteration 0, each $k$ -tuple $\overrightarrow{\boldsymbol{v}}=(v_{1},...,v_{k})\in V(G)^{k}$ is initialized with a color as its atomic type $\bm{at_{k}}(G,\overrightarrow{\boldsymbol{v}})$ . Assume $G$ has $d$ colors, then $\bm{at_{k}}(G,\overrightarrow{\boldsymbol{v}})\in\{0,1\}^{2\binom{k}{2}+kd}$ is an ordered vector encoding. The first $\binom{k}{2}$ entries indicate whether $v_{i}=v_{j}$ , $\forall i,j,1{\leq}i{<}j{\leq}k$ , which is the node repetition information. The second $\binom{k}{2}$ entries indicate whether $(v_{i},v_{j})\in E(G)$ . The last $kd$ entries one-hot encode the initial color of $v_{i}$ , $\forall i,1\leq i\leq k$ . Importantly, $\bm{at_{k}}(G,\overrightarrow{\boldsymbol{v}})=\bm{at_{k}}(G^{\prime},\overrightarrow{\boldsymbol{v^{\prime}}})$ if and only if $v_{i}\mapsto v^{\prime}_{i}$ is an isomorphism from $G[\overrightarrow{\boldsymbol{v}}]$ to $G^{\prime}[\overrightarrow{\boldsymbol{v^{\prime}}}]$ . Let $\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})$ denote the color of $k$ -tuple $\overrightarrow{\boldsymbol{v}}$ on graph $G$ at $t$ -th iteration of $k$ -WL, where colors are initialized with $\bm{wl}_{k}^{(0)}(G,\overrightarrow{\boldsymbol{v}})=\bm{at_{k}}(G,\overrightarrow{\boldsymbol{v}})$ .

At the $t$ -th iteration, $k$ -WL updates the color of each $\overrightarrow{\boldsymbol{v}}\in V(G)^{k}$ according to

\displaystyle\scriptstyle\bm{wl}_{k}^{(t+1)}(G,\overrightarrow{\boldsymbol{v}})=\text{HASH}\Big{(}\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}),\Big{\{}\mskip-10.0mu\Big{\{}\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/1])\Big{|}x\in V(G)\Big{\}}\mskip-10.0mu\Big{\}},\ldots,\Big{\{}\mskip-10.0mu\Big{\{}\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/k])\Big{|}x\in V(G)\Big{\}}\mskip-10.0mu\Big{\}}\Big{)}

(1)

Let $\bm{gwl}_{k}^{(t)}(G)$ denote the encoding of $G$ at $t$ -th iteration of $k$ -WL. Then,

\displaystyle\bm{gwl}_{k}^{(t)}(G)=\text{HASH}\bigg{(}\bigg{\{}\mskip-10.0mu\bigg{\{}\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})\Big{|}\overrightarrow{\boldsymbol{v}}\in V(G)^{k}\bigg{\}}\mskip-10.0mu\bigg{\}}\bigg{)}

(2)

Two $k$ -tuples $\overrightarrow{\boldsymbol{v}}$ and $\overrightarrow{\boldsymbol{u}}$ are connected by an edge if $|\{i\in[k]|v_{i}=u_{i}\}|={k-1}$ , or informally if they share $(k$ $-$ $1)$ entries. Then, $k$ -WL defines a super-graph $S_{k\text{-wl}}(G)$ with its nodes being all $k$ -tuples in $G$ , and edges defined as above. Eq. (1) defines the rule of color refinement on $S_{k\text{-wl}}(G)$ . Intuitively, $k$ -WL is akin to (but more powerful as it orders subgroups of neighbors) running 1-WL algorithm on the supergraph $S_{k\text{-wl}}(G)$ . As $t\rightarrow\infty$ , the color $\bm{wl}_{k}^{(t+1)}(G,\overrightarrow{\boldsymbol{v}})$ converges to a stable value, denoted as $\bm{wl}_{k}^{(\infty)}(G,\overrightarrow{\boldsymbol{v}})$ and the corresponding stable graph color denoted as $\bm{gwl}_{k}^{(\infty)}(G)$ . For two non-isomorphic graphs $G,G^{\prime}$ , $k$ -WL can successfully distinguish them if $\bm{gwl}_{k}^{(\infty)}(G)\not=$ $\bm{gwl}_{k}^{(\infty)}(G^{\prime})$ . The expressivity of $\bm{wl}_{k}^{(t)}$ can be exactly characterized by first-order logic with counting quantifiers. Let $\mathbf{C}_{k}^{t}$ denote all first-order formulas with at most $k$ variables and $t$ -depth counting quantifiers, then $\bm{wl}^{(t)}_{k}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}^{(t)}_{k}(G^{\prime},\overrightarrow{\boldsymbol{v^{\prime}}})$ $\Longleftrightarrow$ $\forall\phi\in\mathbf{C}^{t}_{k},\phi(G,\overrightarrow{\boldsymbol{v}})=\phi(G^{\prime},\overrightarrow{\boldsymbol{v^{\prime}}})$ . Additionally, there is a $t$ -step bijective pebble game that are equivalent to $t$ -iteration $k$ -WL in expressivity. See Appendix.A.2 for the pebble game characterization of $k$ -WL.

Despite its power, $k$ -WL uses all $n^{k}$ tuples and has $O(kn^{k})$ complexity at each iteration.

3.2 From $k$ -WL to $k$ -MultisetWL: Removing Ordering

Our first proposed adaptation to $k$ -WL is to remove ordering in each $k$ -tuple, i.e. changing $k$ -tuples to $k$ -multisets. This greatly reduces the number of supernodes to consider by $O(k!)$ times.

Let $\tilde{\boldsymbol{v}}=\{\mskip-5.0mu\{v_{1},...,v_{k}\}\mskip-5.0mu\}$ be the corresponding multiset of tuple $\overrightarrow{\boldsymbol{v}}=(v_{1},...,v_{k})$ . We introduce a canonical order function on $G$ , $o_{G}:V(G)\rightarrow[n]$ , and a corresponding indexing function $o^{-1}_{G}(\tilde{\boldsymbol{v}},i)$ , which returns the $i$ -th element of $v_{1},...,v_{k}$ sorted according to $o_{G}$ . Let $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})$ denote the color of the $k$ -multiset $\tilde{\boldsymbol{v}}$ at $t$ -th iteration of $k$ -MultisetWL, formally defined next.

At $t=0$ , we initialize the color of $\tilde{\boldsymbol{v}}$ as $\bm{mwl}_{k}^{(0)}(G,\tilde{\boldsymbol{v}})=\text{HASH}\big{(}\{\mskip-5.0mu\{\bm{at}_{k}(G,p(\tilde{\boldsymbol{v}}))|p\in\text{perm[k]}\}\mskip-5.0mu\}\big{)}$ where perm[k]¹¹1This function should also consider repeated elements; we omit this for clarity of presentation. denotes the set of all permutation mappings of $k$ elements. It can be shown that $\bm{mwl}_{k}^{(0)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(0)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ if and only if $G[\tilde{\boldsymbol{v}}]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}]$ are isomorphic.

At $t$ -th iteration, $k$ -MultisetWL updates the color of every $k$ -multiset by

	$\displaystyle\bm{mwl}_{k}^{(t+1)}(G,\tilde{\boldsymbol{v}})=\text{HASH}\bigg{(}\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}),\bigg{\{}\mskip-10.0mu\bigg{\{}$	$\displaystyle\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/o^{-1}_{G}(\tilde{\boldsymbol{v}},1)])\big{\|}x\in V(G)\}\mskip-5.0mu\},$		(3)
	$\displaystyle...,$	$\displaystyle\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/o^{-1}_{G}(\tilde{\boldsymbol{v}},k)])\big{\|}x\in V(G)\}\mskip-5.0mu\}\bigg{\}}\mskip-10.0mu\bigg{\}}\bigg{)}$

Where $\tilde{\boldsymbol{v}}[x/o^{-1}_{G}(\tilde{\boldsymbol{v}},i)])$ denotes replacing the $i$ -th (ordered by $o_{G}$ ) element of the multiset with $x\in V(G)$ . Let $S_{k\text{-mwl}}(G)$ denote the super-graph defined by $k$ -MultisetWL. Similar to Eq. (2), the graph level encoding is $\bm{gmwl}_{k}^{(t)}(G)=\text{HASH}(\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}\big{(}G,\tilde{\boldsymbol{v}}\big{)}\big{|}\forall\tilde{\boldsymbol{v}}\in V(S_{k\text{-mwl}}(G))\}\mskip-5.0mu\}\big{)}$ .

Interestingly, although $k$ -MultisetWL has significantly fewer number of node groups than $k$ -WL, we show it is no less powerful than $k$ - $1$ -WL in terms of distinguishing graphs, while being upper bounded by $k$ -WL in distingushing both node groups and graphs.

Theorem 1.

Let $k\geq 1$ and $\bm{wl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}):=\{\mskip-5.0mu\{\bm{wl}_{k}^{(t)}(G,p(\tilde{\boldsymbol{v}}))|p\in\text{perm[k]}\}\mskip-5.0mu\}$ . For all $t\in\mathbb{N}$ and all graphs $G,G^{\prime}$ : $k$ -MultisetWL is upper bounded by $k$ -WL in distinguishing multisets $G,\tilde{\boldsymbol{v}}$ and $G^{\prime},\tilde{\boldsymbol{v}}^{\prime}$ at $t$ -th iteration, i.e. $\bm{wl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ $\Longrightarrow$ $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ .

Theorem 2.

$k$ -MultisetWL is no less powerful than ( $k$ - $1$ )-WL in distinguishing graphs: for any $k\geq 3$ there exists graphs that can be distinguished by $k$ -MultisetWL but not by ( $k$ - $1$ )-WL.

Theorem 2 is proved by using a variant of a series of CFI [11] graphs which cannot be distinguished by $k$ -WL. This theorem shows that $k$ -MultisetWL is indeed very powerful and finding counter examples of $k$ -WL distinguishable graphs that cannot be distinguished by $k$ -MultisetWL is very hard. Hence we conjecture that $k$ -MultisetWL may have the same expressivity as $k$ -WL in distinguishing undirected graphs, with an attempt of proving it in Appendix.A.4.10. Additionally, the theorem also implies that $k$ -MultisetWL is strictly more powerful than ( $k-1$ )-MultisetWL.

We next give a pebble game characterization of $k$ -MultisetWL, which is named doubly bijective $k$ -pebble game presented in Appendix.A.3. The game is used in the proof of Theorem 2.

Theorem 3.

$k$ -MultisetWL has the same expressivity as the doubly bijective $k$ -pebble game.

3.3 From $k$ -MultisetWL to $k(\leq)$ -SetWL: Removing Repetition

Next, we propose further removing repetition inside any $k$ -multiset, i.e. transforming $k$ -multisets to $k(\leq)$ -sets. We assume elements of $k$ -multiset $\tilde{\boldsymbol{v}}$ and $k(\leq)$ -set $\hat{\boldsymbol{v}}$ in $G$ are sorted based on the ordering function $o_{G}$ , and omit $o_{G}$ for clarity. Let $s(\cdot)$ transform a multiset to set by removing repeats, and let $r(\cdot)$ return a tuple with the number of repeats for each distinct element in a multiset. Specifically, let $\hat{\boldsymbol{v}}=s(\tilde{\boldsymbol{v}})$ and $\hat{\boldsymbol{n}}=r(\tilde{\boldsymbol{v}})$ , then $m:=|\hat{\boldsymbol{v}}|=|\hat{\boldsymbol{n}}|\in[k]$ denotes the number of distinct elements in $k$ -multiset $\tilde{\boldsymbol{v}}$ , and $\forall i\in[m]$ , $\hat{\boldsymbol{v}}_{i}$ is the $i$ -th distinct element with $\hat{\boldsymbol{n}}_{i}$ repetitions. Clearly there is an injective mapping between $\tilde{\boldsymbol{v}}$ and $(\hat{\boldsymbol{v}},\hat{\boldsymbol{n}})$ ; let $f$ be the inverse mapping such that $\tilde{\boldsymbol{v}}=f(s(\tilde{\boldsymbol{v}}),r(\tilde{\boldsymbol{v}}))$ . Equivalently, each $m$ -set $\hat{\boldsymbol{v}}$ can be mapped with a multiset of $k$ -multisets: $\hat{\boldsymbol{v}}\leftrightarrow\{\mskip-5.0mu\{\tilde{\boldsymbol{v}}=f(\hat{\boldsymbol{v}},\hat{\boldsymbol{n}})\ |\ \sum_{i=1}^{m}\hat{\boldsymbol{n}}_{i}=k,\forall i\ \hat{\boldsymbol{n}}_{i}\geq 1\}\mskip-5.0mu\}$ . Based on this relationship, we extend the connection among $k$ -multisets to $k(\leq)$ -sets: given $m_{1},m_{2}\in[k]$ , a $m_{1}$ -set $\hat{\boldsymbol{v}}$ is connected to a $m_{2}$ -set $\hat{\boldsymbol{u}}$ if and only if $\exists\hat{\boldsymbol{n}}_{v},\hat{\boldsymbol{n}}_{u}$ , $f(\hat{\boldsymbol{v}},\hat{\boldsymbol{n}}_{v})$ is connected with $f(\hat{\boldsymbol{u}},\hat{\boldsymbol{n}}_{u})$ in $k$ -MultisetWL. Let $S_{k\text{-swl}}(G)$ denote the defined super-graph on $G$ by $k(\leq)$ -SetWL. It can be shown that this is equivalent to either (1) $(|m_{1}-m_{2}|=1)\land(|\hat{\boldsymbol{v}}\cap\hat{\boldsymbol{u}}|=\min(m_{1},m_{2}))$ or (2) $(m_{1}=m_{2})\land(|\hat{\boldsymbol{v}}\cap\hat{\boldsymbol{u}}|=m_{1}-1)$ is true. Notice that $S_{k\text{-swl}}(G)$ contains a sequence of $k$ $-$ $1$ bipartite graphs with each reflecting the connections among the ( $m$ $-$ $1$ )-sets and the $m$ -sets. It also contains $k$ $-$ $1$ subgraphs, i.e. the connections among the $m$ -sets for $m=2,...,k$ . Later on we will show that these $k$ $-$ $1$ subgraphs can be ignored without affecting $k(\leq)$ -SetWL.

Let $\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}})$ denote the color of $m$ -set $\hat{\boldsymbol{v}}$ at $t$ -th iteration of $k(\leq)$ -SetWL. Now we formally define $k(\leq)$ -SetWL. At $t=0$ , we initialize the color of a $m$ -set $\hat{\boldsymbol{v}}$ ( $m\in[k]$ ) as:

\displaystyle\bm{swl}_{k}^{(0)}(G,\hat{\boldsymbol{v}})=\text{HASH}\big{(}\{\mskip-5.0mu\{\bm{mwl}_{k}^{(0)}(G,f(\hat{\boldsymbol{v}},\hat{\boldsymbol{n}}))\ \big{|}\ \hat{\boldsymbol{n}}_{1}+...+\hat{\boldsymbol{n}}_{m}=k,\forall i\ \hat{\boldsymbol{n}}_{i}\geq 1\}\mskip-5.0mu\}\big{)}

(4)

Clearly $\bm{swl}_{k}^{(0)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k}^{(0)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ if and only if $G[\hat{\boldsymbol{v}}]$ and $G^{\prime}[\hat{\boldsymbol{v}}^{\prime}]$ are isomorphic. At $t$ -th iteration, $k(\leq)$ -SetWL updates the color of every $m$ -set $\hat{\boldsymbol{v}}$ by

	$\displaystyle\bm{swl}_{k}^{(t+1)}(G,\hat{\boldsymbol{v}})=\text{HASH}\bigg{(}\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}),\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}\cup\{x\})\ \|\ x\in V(G)\setminus\hat{\boldsymbol{v}}\}\mskip-5.0mu\},\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}\setminus{x})\ \|\ x\in\hat{\boldsymbol{v}}\}\mskip-5.0mu\},$
	$\displaystyle\bigg{\{}\mskip-10.0mu\bigg{\{}\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}[x/o^{-1}_{G}(\hat{\boldsymbol{v}},1)])\ \|\ x\in V(G)\setminus\hat{\boldsymbol{v}}\}\mskip-5.0mu\},...,\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}[x/o^{-1}_{G}(\hat{\boldsymbol{v}},m)])\ \|\ x\in V(G)\setminus\hat{\boldsymbol{v}}\}\mskip-5.0mu\}\bigg{\}}\mskip-10.0mu\bigg{\}}\bigg{)}$		(5)

Notice that when $m=1$ and $m=k$ , the third and second part of the hashing input is an empty multiset, respectively. Similar to Eq. (2), we formulate the graph level encoding as $\bm{gswl}_{k}^{(t)}(G)=\text{HASH}(\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}\big{(}G,\hat{\boldsymbol{v}}\big{)}\ \big{|}\ \forall\hat{\boldsymbol{v}}\in V(S_{k\text{-swl}}(G))\}\mskip-5.0mu\}\big{)}$ .

To characterize the relationship of their expressivity, we first extend $k$ -MultisetWL on sets by defining the color of a $m$ -set $\hat{\boldsymbol{v}}$ on $k$ -MultisetWL as $\bm{mwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}):=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,f(\hat{\boldsymbol{v}},\hat{\boldsymbol{n}}))\ \big{|}\ \sum_{i=1}^{m}\hat{\boldsymbol{v}}_{i}=k,\forall i\ \hat{\boldsymbol{n}}_{i}\geq 1\}\mskip-5.0mu\}$ . We prove that $k$ -MultisetWL is at least as expressive as $k(\leq)$ -SetWL in terms of separating node sets and graphs.

Theorem 4.

Let $k\geq 1$ , then $\forall t\in\mathbb{N}$ and all graphs $G,G^{\prime}$ : $\bm{mwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})\Longrightarrow\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ .

We also conjecture that $k$ -MultisetWL and $k(\leq)$ -SetWL could be equally expressive, and leave it to future work. As a single $m$ -set corresponds to $\binom{k-1}{m-1}$ $k$ -multisets, moving from $k$ -MultisetWL to $k(\leq)$ -SetWL further reduces the computational cost greatly.

3.4 From $k(\leq)$ -SetWL to $(k,c)(\leq)$ -SetWL: Accounting for Sparsity

Notice that for two arbitrary graphs $G$ and $G^{\prime}$ with equal number of nodes, the number of $k(\leq)$ -sets and the connections among all $k(\leq)$ -sets in $k(\leq)$ -SetWL are exactly the same, regardless of whether they are dense or sparse. We next propose to account for the sparsity of a graph $G$ to further reduce the complexity of $k(\leq)$ -SetWL. As the graph structure is encoded inside every $m$ -set, when the graph becomes sparser, there would be more sparse $m$ -sets with a potentially large number of disconnected components. Based on the hypothesis that the induced subgraph over a set (of nodes) with fewer disconnected components naturally contains more structural information, we propose to restrict the $k(\leq)$ -sets to be $(k,c)(\leq)$ -sets: all sets with at most $k$ nodes and at most $c$ connected components in its induced subgraph. Let $S_{k,c\text{-swl}}(G)$ denote the super-graph defined by $(k,c)(\leq)$ -SetWL, then $S_{k,c\text{-swl}}(G)=S_{k\text{-swl}}(G)[\{\hat{\boldsymbol{v}}|\text{\#components}(G[\hat{\boldsymbol{v}}])\leq c\}]$ , which is the induced subgraph on the super-graph defined by $k(\leq)$ -SetWL. Fortunately, $S_{k,c\text{-swl}}(G)$ can be efficiently and recursively constructed based on $S_{(k-1,c)\text{-swl}}(G)$ , and we include the construction algorithm in the Appendix A.5. $(k,c)(\leq)$ -SetWL can be defined similarly to $k(\leq)$ -SetWL (Eq. (4) and Eq. (5)), however while removing all colors of sets that do not exist on $S_{k,c\text{-swl}}(G)$ .

$(k,c)(\leq)$ -SetWL is progressively expressive with increasing $k$ and $c$ , and when $c=k$ , $(k,c)(\leq)$ -SetWL becomes the same as $k(\leq)$ -SetWL, as all $k(\leq)$ -sets are then considered. Let $\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{v}})$ denote the color of a $(k,c)(\leq)$ -set $\hat{\boldsymbol{v}}$ on $t$ -th iteration of $(k,c)(\leq)$ -SetWL, then $\bm{gswl}_{k,c}^{(t)}(G)=\text{HASH}(\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}\big{(}G,\hat{\boldsymbol{v}}\big{)}\big{|}\forall\hat{\boldsymbol{v}}\in S_{k,c\text{-swl}}(G)\big{)}\}\mskip-5.0mu\}$ .

Theorem 5.

Let $k\geq 1$ , then $\forall t\in\mathbb{N}$ and all graphs $G,G^{\prime}$ :

(1)

when $1\leq c_{1}<c_{2}\leq k$ , if $G,G^{\prime}$ cannot be distinguished by $(k,c_{2})(\leq)$ -SetWL, they cannot be distinguished by $(k,c_{1})(\leq)$ -SetWL
(2)

when $k_{1}<k_{2}$ , $\forall c\leq k_{1}$ , if $G,G^{\prime}$ cannot be distinguished by $(k_{2},c)(\leq)$ -SetWL, they cannot be distinguished by $(k_{1},c)(\leq)$ -SetWL

3.5 Complexity Analysis

All color refinement algorithms described above run on a super-graph; thus, their complexity at each iteration is linear to the number of supernodes and number of edges of the super-graph. Instead of using big $\mathcal{O}$ notation that ignores constant factors, we compare the exact number of supernodes and edges. Let $G$ be the input graph with $n$ nodes and average degree $d$ . For $k$ -WL, there are $n^{k}$ supernodes and each has $n*k$ number of neighbors, hence $S_{k\text{-wl}}(G)$ has $n^{k}$ supernodes and $kn^{k+1}/2$ edges. For $m\in[k]$ , there are $\binom{n}{m}$ $m$ -sets and each connects to $m$ number of $(m-1)$ -sets. So $S_{k\text{-swl}}$ has $\sum_{i=1}^{k}\binom{n}{i}\leq\binom{n}{k}\frac{n-k+1}{n-2k+1}$ supernodes and $\sum_{i=2}^{k}i\binom{n}{i}=n\sum_{i=1}^{k-1}\binom{n-1}{i}\leq n\binom{n-1}{k-1}\frac{n-k+1}{n-2k+2}$ edges (derivation in Appendix). Here we ignore edges within $m$ -sets for any $m\in[k]$ as they can be reconstructed from the bipartite connections among $(m-1)$ -sets and $m$ -sets, described in detail in Sec. 4.1. Consider e.g., $n=30,k=5$ ; we get $\frac{|V(S_{k\text{-wl}}(G))|}{|V(S_{k\text{-swl}}(G))|}=139$ , $\frac{|E(S_{k\text{-wl}}(G))|}{|E(S_{k\text{-swl}}(G))|}=2182$ . Directly analyzing the savings by restricting number of components is not possible without assuming a graph family. In Sec. 5.3 we measured the scalability for $(k,c)(\leq)$ -SetGNN with different number of components directly on sparse graphs, where $(k,c)(\leq)$ -SetWL shares similar scalability.

3.6 Set version of $k$ -FWL

$k$ -FWL is a stronger GI algorithm and it has the same expressivity as $k$ + $1$ -WL [26], in this section we also demonstrate how to extend the set to $k$ -FWL to get $k(\leq)$ -SetFWL. Let $\bm{fwl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})$ denote the color of $k$ -tuple $\overrightarrow{\boldsymbol{v}}$ at $t$ -th iteration of $k$ -FWL. Then $k$ -FWL is initialized the same as the $k$ -WL, i.e. $\bm{fwl}_{k}^{(0)}(G,\overrightarrow{\boldsymbol{v}})$ = $\bm{wl}_{k}^{(0)}(G,\overrightarrow{\boldsymbol{v}})$ . At $t$ -th iteration, $k$ -FWL updates colors with

\displaystyle\scriptstyle\bm{fwl}_{k}^{(t+1)}(G,\overrightarrow{\boldsymbol{v}})=\text{HASH}\Big{(}\bm{fwl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}),\Big{\{}\mskip-10.0mu\Big{\{}\big{(}\bm{fwl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/1]),...,\bm{fwl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/k])\big{)}\Big{|}x\in V(G)\Big{\}}\mskip-10.0mu\Big{\}}\Big{)}

(6)

Let $\bm{sfwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}})$ denote the color of $m$ -set $\hat{\boldsymbol{v}}$ at $t$ -th iteration of $k(\leq)$ -SetFWL. Then at $t$ -th iteration it updates with

\displaystyle\scriptstyle\bm{sfwl}_{k}^{(t+1)}(G,\hat{\boldsymbol{v}})=\text{HASH}\Big{(}\bm{sfwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}),\Big{\{}\mskip-10.0mu\Big{\{}\big{\{}\mskip-10.0mu\big{\{}\bm{sfwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}[x/1]),...,\bm{sfwl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}[x/m])\big{\}}\mskip-10.0mu\big{\}}\Big{|}x\in V(G)\Big{\}}\mskip-10.0mu\Big{\}}\Big{)}

(7)

The $k(\leq)$ -SetFWL should have better expressivity than $k(\leq)$ -SetWL. We show in the next section that $k(\leq)$ -SetWL can be further improved with less computation through an intermediate step while this is nontrivial for $k(\leq)$ -SetFWL. We leave it to future work of studying $k(\leq)$ -SetFWL.

4 A practical progressively-expressive GNN: $(k,c)(\leq)$ -SetGNN

In this section we transform $(k,c)(\leq)$ -SetWL to a GNN model by replacing the HASH function in Eq. (5) with a combination of MLP and DeepSet [64], given they are universal function approximators for vectors and sets, respectively [31, 64]. After the transformation, we propose two additional improvements (Sec. 4.2 and Sec. 4.3) to further improve scalability. We work on vector-attributed graphs. Let $G=(V(G),E(G),X)$ be an undirected graph with node features ${\mathbf{x}}_{i}\in\mathbb{R}^{d},\forall i\in V(G)$ .

4.1 From $(k,c)(\leq)$ -SetWL to $(k,c)(\leq)$ -SetGNN

$(k,c)(\leq)$ -SetWL defines a super-graph $S_{k,c\text{-swl}}$ , which aligns with Eq. (5). We first rewrite Eq. (5) to reflect its connection to $S_{k,c\text{-swl}}$ . For a supernode $\hat{\boldsymbol{v}}$ in $S_{k,c\text{-swl}}$ , let $\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})=\{\hat{\boldsymbol{u}}\ |\ \hat{\boldsymbol{u}}\in S_{k,c\text{-swl}},\hat{\boldsymbol{u}}\leftrightarrow\hat{\boldsymbol{v}}\text{ and }|\hat{\boldsymbol{u}}|=|\hat{\boldsymbol{v}}|-1\}$ , and $\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})=\{\hat{\boldsymbol{u}}\ |\ \hat{\boldsymbol{u}}\in S_{k,c\text{-swl}},\hat{\boldsymbol{u}}\leftrightarrow\hat{\boldsymbol{v}}\text{ and }|\hat{\boldsymbol{u}}|=|\hat{\boldsymbol{v}}|+1\}$ . Then we can rewrite Eq. (5) for $(k,c)(\leq)$ -SetWL as

\displaystyle\bm{swl}_{k,c}^{(t+1)}(G,\hat{\boldsymbol{v}})=\bigg{(}\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{v}}),\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}}),\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\},\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}\bigg{)}

(8)

where $\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}}):=\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ . Notice that we omit HASH and apply it implicitly. Eq. (8) essentially splits the computation of Eq. (5) into two steps and avoids repeated computation via caching the explicit $t$ $+$ $\frac{1}{2}$ step. It also implies that the connection among $m$ -sets for any $m\in[k]$ can be reconstructed from the bipartite graph among $m$ -sets and $(m-1)$ -sets.

Next we formulate $(k,c)(\leq)$ -SetGNN formally. Let $h^{(t)}(\hat{\boldsymbol{v}})\in\mathbb{R}^{d_{t}}$ denote the vector representation of supernode $\hat{\boldsymbol{v}}$ on the $t$ -th iteration of $(k,c)(\leq)$ -SetGNN. For any input graph $G$ , it initializes representations of supernodes by

\displaystyle h^{(0)}(\hat{\boldsymbol{v}})=\text{BaseGNN}(G[\hat{\boldsymbol{v}}])

(9)

where the BaseGNN can be any GNN model. Theoretically the BaseGNN should be chosen to encode non-isomorphic induced subgraphs distinctly, mimicking HASH. Based on empirical tests of several GNNs on all (11,117) possible $8$ -node non-isomorphic graphs [6], and given GIN [61] is simple and fast with nearly $100\%$ separation rate, we use GIN as our BaseGNN in experiments.

$(k,c)(\leq)$ -SetGNN iteratively updates representations of all supernodes by

	$\displaystyle\scriptstyle h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}})$	$\displaystyle=\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t+\frac{1}{2})}(h^{(t)}(\hat{\boldsymbol{u}}))$		(10)
	$\displaystyle h^{(t+1)}(\hat{\boldsymbol{v}})$	$\displaystyle=\text{MLP}^{(t)}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t)}_{A}(h^{(t)}(\hat{\boldsymbol{u}})),\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t)}_{B}(h^{(t+\frac{1}{2})}(\hat{\boldsymbol{u}}))\Big{)}$		(11)

Then after $T$ iterations, we compute the graph level encoding as

\displaystyle h^{(T)}(G)=\text{POOL}(\{\mskip-5.0mu\{h^{(T)}(\hat{\boldsymbol{v}})\ |\ \hat{\boldsymbol{v}}\in V(S_{k,c\text{-swl}})\}\mskip-5.0mu\})

(12)

where POOL can be chosen as summation. We visualize the steps in Figure 1. Under mild conditions, $(k,c)(\leq)$ -SetGNN has the same expressivity as $(k,c)(\leq)$ -SetWL.

Theorem 6.

When ( $i$ ) BaseGNN can distinguish any non-isomorhpic graphs with at most $k$ nodes, ( $ii$ ) all MLPs have sufficient depth and width, and ( $iii$ ) POOL is an injective function, then for any $t\in\mathbb{N}$ , $t$ -layer $(k,c)(\leq)$ -SetGNN is as expressive as $(k,c)(\leq)$ -SetWL at the $t$ -th iteration.

The following facts can be derived easily from Theorem 5 and Theorem 6.

Corollary 6.1.

$(k,c)(\leq)$ -SetGNN is progressively-expressive with increasing $k$ and $c$ , that is,

(1)

when $c_{1}>c_{2}$ , $(k,c_{1})(\leq)$ -SetGNN is more expressive than $(k,c_{2})(\leq)$ -SetGNN, and
(2)

when $k_{1}>k_{2}$ , $(k_{1},c)(\leq)$ -SetGNN is more expressive than $(k_{2},c)(\leq)$ -SetGNN.

4.2 Bidirectional Sequential Message Passing

The $t$ -th layer of $(k,c)(\leq)$ -SetGNN (Eq. (10) and Eq. (11)) are essentially propagating information back and forth on the super-graph $S_{k,c\text{-swl}}(G)$ , which is a sequence of $k$ $-$ $1$ bipartite graphs (see the middle of Figure 1), in parallel for all supernodes. We propose to change it to bidirectional blockwise sequential message passing, which we call $(k,c)(\leq)$ -SetGNN^∗, defined as follows.

	$\displaystyle m=k-1\text{ to }1,\forall m\text{-set }\hat{\boldsymbol{v}},h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}})=\text{MLP}_{m,1}^{(t)}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}_{m,2}^{(t)}(h^{(t+\frac{1}{2})}(\hat{\boldsymbol{u}}))\Big{)}$		(13)
	$\displaystyle m=2\text{ to }k,\forall m\text{-set }\hat{\boldsymbol{v}},h^{(t+1)}(\hat{\boldsymbol{v}})=\text{MLP}_{m,1}^{(t+\frac{1}{2})}\Big{(}h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}_{m,2}^{(t+\frac{1}{2})}(h^{(t+1)}(\hat{\boldsymbol{u}}))\Big{)}$		(14)

Notice that $(k,c)(\leq)$ -SetGNN^∗ has lower memory usage, as $(k,c)(\leq)$ -SetGNN load the complete supergraph ( $k$ $-$ $1$ bipartites) while $(k,c)(\leq)$ -SetGNN^∗ loads 1 out of $k$ $-$ $1$ bipartites at a time, which is beneficial for limited-size GPUs. What is more, for a small, finite $t$ it is even more expressive than $(k,c)(\leq)$ -SetGNN. We provide both implementation of parallel and sequential message passing in the official github repository, while only report the performance of $(k,c)(\leq)$ -SetGNN^∗ given its efficiency and better expressivity.

Theorem 7.

For any $t\in\mathbb{N}$ , the $t$ -layer $(k,c)(\leq)$ -SetGNN^∗ is more expressive than the $t$ -layer $(k,c)(\leq)$ -SetGNN. As $\lim_{t\to\infty}$ , $(k,c)(\leq)$ -SetGNN is as expressive as $(k,c)(\leq)$ -SetGNN^∗.

4.3 Improving Supernode Initialization

Next we describe how to improve the supernode initialization (Eq. (9)) and extensively reduce computational and memory overhead for $c>1$ without losing any expressivity. We achieve this by the fact that a graph with $c$ components can be viewed as a set of $c$ connected components. Formally,

Theorem 8.

Let $G$ be a graph with $c$ connected components $C_{1},...,C_{c}$ , and $G^{\prime}$ be a graph also with $c$ connected components $C^{\prime}_{1},...,C^{\prime}_{c}$ , then $G$ and $G^{\prime}$ are isomorphic if and only if $\exists p:[c]\rightarrow[c]$ , s.t. $\forall i\in[c]$ , $C_{i}$ and $C^{\prime}_{p(i)}$ are isomorphic.

The theorem implies that we only need to apply Eq. (9) to all supernodes with a single component, and for any $c$ -components $\hat{\boldsymbol{v}}$ with components $\hat{\boldsymbol{u}}_{1},...,\hat{\boldsymbol{u}}_{c}$ , we can get the encoding by $h^{(0)}(\hat{\boldsymbol{v}})=\text{DeepSet}(\{h^{(0)}(\hat{\boldsymbol{u}}_{1}),...,h^{(0)}(\hat{\boldsymbol{u}}_{c})\})$ without passing its induced subgraph to BaseGNN. This eliminates the heavy computation of passing a large number of induced subgraphs. We give the algorithm of building connection between $\hat{\boldsymbol{v}}$ to its single components in Appendix A.6.

5 Experiments

We design experiments to answer the following questions. Q1. Performance: How does $(k,c)(\leq)$ -SetGNN^∗ compare to SOTA expressive GNNs? Q2. Varying $k$ and $c$ : How does the progressively increasing expressiveness reflect on generalization performance? Q3. Computational requirements: Is $(k,c)(\leq)$ -SetGNN feasible on practical graphs w.r.t. running time and memory usage?

5.1 Setup

Datasets. To inspect the expressive power, we use four different types of simulation datasets: 1) EXP [1] contains 600 pairs of $1\&2$ -WL failed graphs which we split into two where graphs in each pair is assigned to two different classes; 2) SR25 [6] has 15 strongly regular graphs ( $3$ -WL failed) with 25 nodes each, which we transform to a 15-way classification task; 3) Substructure counting (i.e. triangle, tailed triangle, star and 4-cycle) tasks on random graph dataset [15]; 4) Graph property regression (i.e. connectedness, diameter, radius) tasks on random graph dataset [17]. We also evaluate performance on two real world graph learning tasks: 5) ZINC-12K [19], and 6) QM9 [59] for molecular property regression. See Table 6 in Appendix A.7 for detailed dataset statistics.

Baselines. We use GCN [36], GIN [61], PNA^∗ [17], PPGN [40], PF-GNN [18], and GNN-AK [68] as baselines on the simulation datasets. On ZINC-12K we also reference CIN [9] directly from literature. Most baselines results are taken from [68]. Finally, we compare to GINE [32] on QM9. Hyperparameter and model configurations are described in Appendix A.8.

5.2 Results

Theorem 2 shows that $k$ -MultisetWL is able to distinguish CFI(k) graphs. It also holds for $k(\leq)$ -SetWL as the proof doesn’t use repetitions. We implemented the construction of CFI(k) graphs for any k. Empirically we found that $(k,c)(\leq)$ -SetGNN^∗ is able to distinguish CFI(k) for k = 3, 4, 5 (k>6 out of memory). This empirically verifies its theoretical expressivity.

Table 1 shows the performance results on all the simulation datasets. The low expressive models such as GCN, GIN and PNA^∗ underperform on EXP and SR25, while 3-WL equivalent PPGN excels only on EXP. Notably, $(k,c)(\leq)$ -SetGNN^∗ achieves 100% discriminating power with any $k$ larger than 2 and 3, and any $c$ larger than 1 and 0, resp. for EXP and SR25. $(k,c)(\leq)$ -SetGNN^∗ also outperforms the baselines on all substructure counting tasks, as well as on two out of three graph property prediction tasks (except Radius) with significant gap, for relatively small values of $k$ and $c$ .

Table 1: Simulation data performances. For

(k,c)(\leq)

-SetGNN^∗,

(k,c)

values that achieve reported performance in parenthesis. (ACC: accuracy, MA[S]E: mean abs.[sq.] error)

Method	EXP (ACC)	SR25 (ACC)	Counting Substructures (MAE)				Graph Properties ( $\log_{10}$ (MSE))
Method	EXP (ACC)	SR25 (ACC)	Triangle	Tailed Tri.	Star	4-Cycle	IsConnected	Diameter	Radius
GCN	50%	6.67%	0.4186	0.3248	0.1798	0.2822	-1.7057	-2.4705	-3.9316
GIN	50%	6.67%	0.3569	0.2373	0.0224	0.2185	-1.9239	-3.3079	-4.7584
PNA^∗	50%	6.67%	0.3532	0.2648	0.1278	0.2430	-1.9395	-3.4382	-4.9470
PPGN	100%	6.67%	0.0089	0.0096	0.0148	0.0090	-1.9804	-3.6147	-5.0878
GIN-AK⁺	100%	6.67%	0.0123	0.0112	0.0150	0.0126	-2.7513	-3.9687	-5.1846
PNA^∗-AK⁺	100%	6.67%	0.0118	0.0138	0.0166	0.0132	-2.6189	-3.9011	-5.2026
$(k,c)$ $(\leq)$	100%	100%	0.0073	0.0075	0.0134	0.0075	-5.4667	-4.0800	-5.1603
	( $\geq$ $3$ , $\geq$ $2$ )	( $\geq$ $4$ , $\geq$ $1$ )	( $3$ , $2$ )	( $4$ , $1$ )	( $3$ , $2$ )	( $4$ , $1$ )	( $4$ , $2$ )	( $4$ , $1$ )	( $2$ , $2$ )

Table 2: Train and Test performances on substructure counting tasks by varying

k

and

c

. Notice the orders of magnitude drop in Test MAE between bolded entries per task.

		Counting Substructures (MAE)
		Triangle		Tailed Tri.		Star		4-Cycle
k	c	Train	Test	Train	Test	Train	Test	Train	Test
2	1	0.9941 ± 0.2623	1.1409 ± 0.1224	1.1506 ± 0.2542	0.8695 ± 0.0781	1.5348 ± 2.0697	2.3454 ± 0.8198	1.2159 ± 0.0292	0.8361 ± 0.1171
3	1	0.0311 ± 0.0025	0.0088 ± 0.0001	0.0303 ± 0.0108	0.0085 ± 0.0018	0.0559 ± 0.0019	0.0151 ± 0.0006	0.1351 ± 0.0058	0.1893 ± 0.0030
4	1	0.0321 ± 0.0008	0.0151 ± 0.0074	0.0307 ± 0.0085	0.0075 ± 0.0012	0.0687 ± 0.0104	0.0339 ± 0.0009	0.0349 ± 0.0007	0.0075 ± 0.0002
5	1	0.0302 ± 0.0070	0.0208 ± 0.0042	0.0553 ± 0.0009	0.0189 ± 0.0024	0.0565 ± 0.0078	0.0263 ± 0.0023	0.0377 ± 0.0057	0.0175 ± 0.0036
6	1	0.0344 ± 0.0024	0.0247 ± 0.0085	0.0357 ± 0.0017	0.0171 ± 0.0000	0.0560 ± 0.0000	0.0168 ± 0.0022	0.0356 ± 0.0014	0.0163 ± 0.0064
2	2	0.3452 ± 0.0329	0.4029 ± 0.0053	0.2723 ± 0.0157	0.2898 ± 0.0055	0.0466 ± 0.0025	0.0242 ± 0.0006	0.2369 ± 0.0123	0.2512 ± 0.0029
3	2	0.0234 ± 0.0030	0.0073 ± 0.0009	0.0296 ± 0.0074	0.0100 ± 0.0009	0.0640 ± 0.0003	0.0134 ± 0.0006	0.0484 ± 0.0135	0.0194 ± 0.0065
4	2	0.0587 ± 0.0356	0.0131 ± 0.0010	0.0438 ± 0.0140	0.0094 ± 0.0002	0.0488 ± 0.0008	0.0209 ± 0.0063	0.0464 ± 0.0037	0.0110 ± 0.0020

In Table 2, we show the train and test MAEs on substructure counting tasks for individual values of $k$ and $c$ . As expected, performances improve for increasing $k$ when $c$ is fixed, and vice versa. It is notable that orders of magnitude improvements on test error occur moving from $k$ $=$ $2$ to $3$ for the triangle tasks as well as the star task, while a similarly large magnitude drop is obtained at $k$ $=$ $4$ for the 4-cycle task, which is expected as triangle and 4-cycle have 3 and 4 nodes respectively. Similar observations hold for graph property tasks as well. (See Appendix A.9)

Table 3: SetGNN^∗ achieves SOTA on ZINC-12K.

Method	MAE
GatedGCN	0.363 $\pm$ 0.009
GCN	0.321 $\pm$ 0.009
PNA	0.188 $\pm$ 0.004
DGN	0.168 $\pm$ 0.003
GIN	0.163 $\pm$ 0.004
GINE	0.157 $\pm$ 0.004
HIMP	0.151 $\pm$ 0.006
PNA^∗	0.140 $\pm$ 0.006
GSN	0.115 $\pm$ 0.012
PF-GNN	0.122 $\pm$ 0.010
GIN-AK⁺	0.080 $\pm$ 0.001
CIN	0.079 $\pm$ 0.006
$(k,c)$ $(\leq)$	0.0750 $\pm$ 0.0027

In addition to simulation datasets, we evaluate our $(k,c)(\leq)$ -SetGNN^∗ on real-world data; Table 3 shows our performance on ZINC-12K. Our method achieves a new state-of-the-art performance, with a mean absolute error (MAE) of 0.0750, using $k$ $=$ $5$ and $c$ $=$ $2$ .

In Table 5 we show the test and validation MAE along with training loss for varying $k$ and $c$ for ZINC-12K. For a fixed $c$ , validation MAE and training loss both follow a first decaying and later increasing trend with increasing $k$ , potentially owing to the difficulty in fitting with too many sets (i.e. supernodes) and edges in the super-graph.

Similar results are shown for QM9 in Table 5. For comparison we also show GINE performances using both 4 and 6 layers, both of which are significantly lower than $(k,c)(\leq)$ -SetGNN^∗.

Table 4:

(k,c)(\leq)

-SetGNN^∗ performances on ZINC-12K by varying (

k

c

). Test MAE at lowest Val. MAE, and lowest Test MAE.

$k$	$c$	Train loss	Val. MAE	Test MAE
2	1	0.1381 ± 0.0240	0.2429 ± 0.0071	0.2345 ± 0.0131
3	1	0.1172 ± 0.0063	0.2298 ± 0.0060	0.2252 ± 0.0030
4	1	0.0693 ± 0.0111	0.1645 ± 0.0052	0.1636 ± 0.0052
5	1	0.0643 ± 0.0019	0.1593 ± 0.0051	0.1447 ± 0.0013
6	1	0.0519 ± 0.0064	0.0994 ± 0.0093	0.0843 ± 0.0048
7	1	0.0543 ± 0.0048	0.0965 ± 0.0061	0.0747 ± 0.0022
8	1	0.0564 ± 0.0152	0.0961 ± 0.0043	0.0732 ± 0.0037
9	1	0.0817 ± 0.0274	0.0909 ± 0.0094	0.0824 ± 0.0056
10	1	0.0894 ± 0.0266	0.1060 ± 0.0157	0.0950 ± 0.0102
2	2	0.1783 ± 0.0602	0.2913 ± 0.0102	0.2948 ± 0.0210
3	2	0.0640 ± 0.0072	0.1668 ± 0.0078	0.1391 ± 0.0102
4	2	0.0499 ± 0.0043	0.1029 ± 0.0033	0.0836 ± 0.0010
5	2	0.0483 ± 0.0017	0.0899 ± 0.0056	0.0750 ± 0.0027
6	2	0.0530 ± 0.0064	0.0927 ± 0.0050	0.0737 ± 0.0006
7	2	0.0547 ± 0.0036	0.0984 ± 0.0047	0.0784 ± 0.0043
3	3	0.0798 ± 0.0062	0.1881 ± 0.0076	0.1722 ± 0.0086
4	3	0.0565 ± 0.0059	0.1121 ± 0.0066	0.0869 ± 0.0026
5	3	0.0671 ± 0.0156	0.1091 ± 0.0097	0.0920 ± 0.0054

Table 5:

(k,c)(\leq)

-SetGNN^∗ performances on QM9 by varying (

k

c

). Test MAE at lowest Val. MAE, and lowest Test MAE. All variances are

{\leq}0.002

and thus omitted.

$k$	$c$	Train loss	Val. MAE	Test MAE
2	1	0.0376 ± 0.0005	0.0387 ± 0.0007	0.0389 ± 0.0008
3	1	0.0308 ± 0.0010	0.0386 ± 0.0017	0.0379 ± 0.0010
4	1	0.0338 ± 0.0003	0.0371 ± 0.0005	0.0370 ± 0.0006
5	1	0.0299 ± 0.0017	0.0343 ± 0.0008	0.0341 ± 0.0009
6	1	0.0226 ± 0.0004	0.0296 ± 0.0007	0.0293 ± 0.0007
7	1	0.0208 ± 0.0005	0.0289 ± 0.0007	0.0269 ± 0.0003
2	2	0.0367 ± 0.0007	0.0398 ± 0.0004	0.0398 ± 0.0004
3	2	0.0282 ± 0.0013	0.0358 ± 0.0009	0.0356 ± 0.0007
4	2	0.0219 ± 0.0004	0.0280 ± 0.0008	0.0278 ± 0.0008
5	2	0.0175 ± 0.0003	0.0267± 0.0005	0.0251± 0.0006
3	3	0.0391 ± 0.0107	0.0428 ± 0.0057	0.0425 ± 0.0052
4	3	0.0219 ± 0.0011	0.0301 ± 0.0010	0.0286 ± 0.0004
GINE ( $L$ $=$ $4$ )		0.0507 ± 0.0014	0.0478 ± 0.0003	0.0479 ± 0.0004
GINE ( $L$ $=$ $6$ )		0.0440 ± 0.0009	0.0440 ± 0.0009	0.0451 ± 0.0009

5.3 Computational requirements

We next investigate how increasing $k$ and $c$ change the computational footprint of $(k,c)(\leq)$ -SetGNN^∗ in practice. Fig. 2 shows that increasing these parameters expectedly increases both memory consumption (in MB in (a)) as well as runtime (in seconds per epoch in (b)). Notably, since larger $k$ and $c$ increases our model’s expressivity, we observe that suitable choices for $k$ and $c$ allow us to practically realize these increasingly expressive models on commodity hardware. With conservative values of $c$ (e.g. $c$ $=$ $1$ ), we are able to consider passing messages between sets of $k$ (e.g. 10) nodes far larger than $k$ -WL-style higher order models can achieve ( $\leq$ 3).

6 Conclusion

Our work is motivated by the impracticality of higher-order GNN models based on the $k$ -WL hierarchy, which make it challenging to study how much expressiveness real-world tasks truly necessitate. To this end, we proposed $(k,c)(\leq)$ -SetWL, a more practical and progressively-expressive hierarchy with theoretical connections to $k$ -WL and drastically lowered complexity. We also designed and implemented a practical model $(k,c)(\leq)$ -SetGNN $(^{*})$ , expressiveness of which is gradually increased by larger $k$ and $c$ . Our model achieves strong performance, including several new best results on graph-level tasks like ZINC-12K and expressiveness-relevant tasks like substructure counting, while being practically trainable on commodity hardware.

References

Abboud et al. [2021] R. Abboud, İ. İ. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Proceedings of the Thirtieth International Joint Conference on Artifical Intelligence (IJCAI), 2021.
Alon and Yahav [2021] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=i80OPhOCVH2.
Arvind et al. [2020] V. Arvind, F. Fuhlbrück, J. Köbler, and O. Verbitsky. On weisfeiler-leman invariance: Subgraph counts and related graph properties. Journal of Computer and System Sciences, 113:42–59, 2020.
Azizian and Lelarge [2020] W. Azizian and M. Lelarge. Expressive power of invariant and equivariant graph neural networks. arXiv preprint arXiv:2006.15646, 2020.
Babai et al. [1980] L. Babai, P. Erdos, and S. M. Selkow. Random graph isomorphism. SIAM Journal on Computing, 9(3):628–635, 1980.
Balcilar et al. [2021] M. Balcilar, P. Héroux, B. Gauzere, P. Vasseur, S. Adam, and P. Honeine. Breaking the limits of message passing graph neural networks. In International Conference on Machine Learning, pages 599–608. PMLR, 2021.
Battaglia et al. [2018] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
Bevilacqua et al. [2021] B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. M. Bronstein, and H. Maron. Equivariant subgraph aggregation networks. arXiv preprint arXiv:2110.02910, 2021.
Bodnar et al. [2021] C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625–2640, 2021.
Bouritsas et al. [2022] G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Cai et al. [1992] J.-Y. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.
Chen et al. [2020a] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3438–3445, 2020a.
Chen et al. [2020b] L. Chen, Z. Chen, and J. Bruna. On graph neural networks versus graph-augmented mlps. arXiv preprint arXiv:2010.15116, 2020b.
Chen et al. [2019] Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1g0Z3A9Fm.
Chen et al. [2020c] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383–10395, 2020c.
Choudhary and DeCost [2021] K. Choudhary and B. DeCost. Atomistic line graph neural network for improved materials property predictions. npj Computational Materials, 7(1):1–8, 2021.
Corso et al. [2020] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veličković. Principal neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Systems, volume 33, pages 13260–13271, 2020.
Dupty et al. [2022] M. H. Dupty, Y. Dong, and W. S. Lee. PF-GNN: Differentiable particle filtering based approximation of universal graph representations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oh4TirnfSem.
Dwivedi et al. [2020] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
Fey and Lenssen [2019] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
Frasca et al. [2022] F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron. Understanding and extending subgraph gnns by rethinking their symmetries. arXiv preprint arXiv:2206.11140, 2022.
Garg et al. [2020] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pages 3419–3430. PMLR, 2020.
Geerts [2020] F. Geerts. The expressive power of kth-order invariant graph networks. arXiv preprint arXiv:2007.12035, 2020.
Geerts and Reutter [2022] F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=wIzUeM3TAU.
Gilmer et al. [2017] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017.
Grohe [2021] M. Grohe. The logic of graph neural networks. In 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 1–17. IEEE, 2021.
Grohe [2022] M. Grohe. Multiset wl. Personal communication, 2022.
Grohe and Otto [2015] M. Grohe and M. Otto. Pebble games and linear equations. The Journal of Symbolic Logic, 80(3):797–844, 2015.
Hawkins [2004] D. M. Hawkins. The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12, 2004.
Hella [1996] L. Hella. Logical hierarchies in ptime. Information and Computation, 129(1):1–19, 1996.
Hornik et al. [1989] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
Hu et al. [2020] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
Keriven and Peyré [2019] N. Keriven and G. Peyré. Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Kim et al. [2022] J. Kim, T. D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, and S. Hong. Pure transformers are powerful graph learners. arXiv, abs/2207.02505, 2022. URL https://arxiv.org/abs/2207.02505.
Kipf and Welling [2016] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Kipf and Welling [2017] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
Klicpera et al. [2018] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
Loukas [2020] A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l2bp4YwS.
Ma et al. [2021] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
Maron et al. [2019a] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. Advances in neural information processing systems, 32, 2019a.
Maron et al. [2019b] H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the universality of invariant networks. In International conference on machine learning, pages 4363–4371. PMLR, 2019b.
Morris et al. [2017] C. Morris, K. Kersting, and P. Mutzel. Glocalized weisfeiler-lehman graph kernels: Global-local feature maps of graphs. In 2017 IEEE International Conference on Data Mining (ICDM), pages 327–336. IEEE, 2017.
Morris et al. [2019] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4602–4609, 2019.
Morris et al. [2020] C. Morris, G. Rattan, and P. Mutzel. Weisfeiler and leman go sparse: Towards scalable higher-order graph embeddings. Advances in Neural Information Processing Systems, 33:21824–21840, 2020.
Morris et al. [2022] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh. Speqnets: Sparsity-aware permutation-equivariant graph networks. arXiv preprint arXiv:2203.13913, 2022.
Murphy et al. [2019] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Relational pooling for graph representations. In International Conference on Machine Learning, pages 4663–4673. PMLR, 2019.
Oono and Suzuki [2019] K. Oono and T. Suzuki. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947, 2019.
Qian et al. [2022] C. Qian, G. Rattan, F. Geerts, C. Morris, and M. Niepert. Ordered subgraph aggregation networks. arXiv preprint arXiv:2206.11168, 2022.
Read and Corneil [1977] R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of graph theory, 1(4):339–363, 1977.
Rossi et al. [2020] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti. Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198, 2020.
Sankar et al. [2021] A. Sankar, Y. Liu, J. Yu, and N. Shah. Graph neural networks for friend ranking in large-scale social platforms. In Proceedings of the Web Conference 2021, pages 2535–2546, 2021.
Sato [2020] R. Sato. A survey on the expressive power of graph neural networks. CoRR, abs/2003.04078, 2020. URL http://dblp.uni-trier.de/db/journals/corr/corr2003.html#abs-2003-04078.
Sato et al. [2021] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 333–341. SIAM, 2021.
Shervashidze et al. [2011] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011.
Topping et al. [2021] J. Topping, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522, 2021.
Veličković et al. [2017] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
Weisfeiler [1976] B. Weisfeiler. On construction and identification of graphs. In LECTURE NOTES IN MATHEMATICS. Citeseer, 1976.
Wu et al. [2020a] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui. Graph neural networks in recommender systems: a survey. ACM Computing Surveys (CSUR), 2020a.
Wu et al. [2018] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
Wu et al. [2020b] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020b.
Xu et al. [2019] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR ’19, pages 1–17, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
You et al. [2019] J. You, R. Ying, and J. Leskovec. Position-aware graph neural networks. In International Conference on Machine Learning, pages 7134–7143. PMLR, 2019.
You et al. [2021] J. You, J. Gomes-Selman, R. Ying, and J. Leskovec. Identity-aware graph neural networks. arXiv preprint arXiv:2101.10320, 2021.
Zaheer et al. [2017] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. Advances in neural information processing systems, 30, 2017.
Zhang and Li [2021] M. Zhang and P. Li. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734–15747, 2021.
Zhang et al. [2020] M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin. Revisiting graph neural networks for link prediction. 2020.
Zhao and Akoglu [2020] L. Zhao and L. Akoglu. Pairnorm: Tackling oversmoothing in {gnn}s. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkecl1rtwB.
Zhao et al. [2022] L. Zhao, W. Jin, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any gnn with local structure awareness. ICLR, 2022.

Appendix A Appendix

A.1 Visualization Supergraph Connection of 2-SetWL

A.2 Bijective $k$ -Pebble Game for $k$ -WL

The pebble game characterization of k-FWL appeared in [11]. We use the pebble game defined in [28] for k-WL. Let $G,G^{\prime}$ be graphs and $\overrightarrow{\boldsymbol{v}}_{0},\overrightarrow{\boldsymbol{v}}_{0}^{\prime}$ be $k$ -tuple. Then the bijective $k$ -pebble game $\text{BP}_{k}$ ( $G,\overrightarrow{\boldsymbol{v}}_{0},G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ) is defined as the follows. The game is played by two players Player I and Player II. The game proceeds in rounds starting from initial position $(\overrightarrow{\boldsymbol{v}}_{0},\overrightarrow{\boldsymbol{v}}_{0}^{\prime})$ and continues to new position $(\overrightarrow{\boldsymbol{v}}_{t},\overrightarrow{\boldsymbol{v}}_{t}^{\prime})$ as long as after $t$ rounds $\overrightarrow{\boldsymbol{v}}_{t}[i]\mapsto\overrightarrow{\boldsymbol{v}}^{\prime}_{t}[i]$ defines an isomorphism between $G[\overrightarrow{\boldsymbol{v}}_{t}]$ and $G^{\prime}[\overrightarrow{\boldsymbol{v}}^{\prime}_{t}]$ . If the game has not ended after $t$ rounds, Player II wins the $t$ -step $\text{BP}_{k}$ ( $G,\overrightarrow{\boldsymbol{v}}_{0},G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ), otherwise Player I wins.

The $t$ -th round is played as follows.

1.

Player I picks up the $i$ -th pair of pebbles with $i\in[k]$ .
2.

Player II chooses bijection $f:V(G)\to V(G^{\prime})$ .
3.

Player I chooses $x\in V(G)$ .
4.

The new position is ( $\overrightarrow{\boldsymbol{v}}_{t}[x/i]$ , $\overrightarrow{\boldsymbol{v}}^{\prime}_{t}[f(x)/i]$ ). If $G[\overrightarrow{\boldsymbol{v}}_{t}[x/i]]$ and $G[\overrightarrow{\boldsymbol{v}}^{\prime}_{t}[f(x)/i]]$ are still isomorphic, the game continues. Otherwise, the game ends and Player II loses.

The game $\text{BP}_{k}$ ( $G,\overrightarrow{\boldsymbol{v}}_{0},G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ) has the same expressivity as $k$ -WL in distinguishing ( $G,\overrightarrow{\boldsymbol{v}}_{0}$ ) and ( $G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ).

Theorem 9.

$k$ -WL cannot distinguish ( $G,\overrightarrow{\boldsymbol{v}}_{0}$ ) and ( $G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ) at step $t$ , i.e. $\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}_{0})=\bm{wl}_{k}^{(t)}(G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0})$ , if and only if Player II has a winning strategy for $t$ -step $\text{BP}_{k}$ ( $G,\overrightarrow{\boldsymbol{v}}_{0},G^{\prime},\overrightarrow{\boldsymbol{v}}^{\prime}_{0}$ ).

There is an extension of the pebble game, $\text{BP}_{k}$ ( $G,G^{\prime}$ ), without specifying the starting position. Specifically, at the beginning of the game, Player II is first asked to provide a bijection $g:V(G)^{k}\to V(G^{\prime})^{k}.$ Then Player I chooses $\overrightarrow{\boldsymbol{v}}_{0}\in V(G)^{k}$ . Then Player I and Player II start to play $\text{BP}_{k}$ ( $G,\overrightarrow{\boldsymbol{v}}_{0},G^{\prime},g(\overrightarrow{\boldsymbol{v}}_{0})$ ).

Theorem 10.

$k$ -WL cannot distinguish G and $G^{\prime}$ at step $t$ , i.e. $\bm{gwl}_{k}^{(t)}(G)=\bm{gwl}_{k}^{(t)}(G^{\prime})$ , if and only if Player II has a winning strategy for $t$ -step $\text{BP}_{k}$ ( $G,G^{\prime})$ .

The proof of Theorem 9 and Theorem 10 can be found in Hella’s work [30].

A.3 Doubly Bijective $k$ -Pebble Game for $k$ -MultisetWL

Let $G,G^{\prime}$ be graphs and $\tilde{\boldsymbol{v}}_{0},\tilde{\boldsymbol{v}}^{\prime}_{0}$ be $k$ -multisets. We adapt the $\text{BP}_{k}$ game for $k$ -MultisetWL and call it doubly bijective $k$ -pebble game, i.e. $\text{DBP}_{k}(G,\tilde{\boldsymbol{v}}_{0},G^{\prime},\tilde{\boldsymbol{v}}^{\prime}_{0})$ . A similar version of the pebble game for $k$ -MultisetWL was suggested by Grohe in [27].

The $\text{DBP}_{k}(G,\tilde{\boldsymbol{v}}_{0},G^{\prime},\tilde{\boldsymbol{v}}^{\prime}_{0})$ is defined as the follows. The game starts at the position ( $\tilde{\boldsymbol{v}}_{0},\tilde{\boldsymbol{v}}^{\prime}_{0}$ ).

Let the current position be $(\tilde{\boldsymbol{v}}_{t},\tilde{\boldsymbol{v}}^{\prime}_{t})$ . The $t$ -th round is played as follows.

1.

Player II chooses a bijection $h:\tilde{\boldsymbol{v}}_{t}\to\tilde{\boldsymbol{v}}^{\prime}_{t}$
2.

Player I chooses the $y\in\tilde{\boldsymbol{v}}_{t}$
3.

Player II chooses bijection $f:V(G)\to V(G^{\prime})$
4.

Player I chooses $x\in V(G)$
5.

The new position is $(\tilde{\boldsymbol{v}}_{t}[x/\text{idx}_{\tilde{\boldsymbol{v}}_{t}}(y)],\tilde{\boldsymbol{v}}^{\prime}_{t}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}_{t}}(h(y))])$ and the game continues, if $G[\tilde{\boldsymbol{v}}_{t}[x/\text{idx}_{\tilde{\boldsymbol{v}}_{t}}(y)]]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}_{t}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}_{t}}(h(y))]]$ are isomorphic. Otherwise the game ends and Player II loses.

To extend the game to the setting of no starting position, we define the corresponding game $\text{DBP}_{k}(G,G^{\prime})$ . Simliar to $\text{BP}_{k}(G,G^{\prime})$ , at the beginning Player II is asked to pick up a bijection $g:\text{Multiset}(V(G)^{k})\to\text{Multiset}(V(G^{\prime})^{k})$ . Then Player I picks up a $\tilde{\boldsymbol{v}}_{0}$ and the game $\text{DBP}_{k}(G,\tilde{\boldsymbol{v}}_{0},G^{\prime},g(\tilde{\boldsymbol{v}}_{0}))$ starts.

Theorem 3.

$k$ -MultisetWL has the same expressivity as the doubly bijective $k$ -pebble game.

Proof.

Specifically, given graph $G$ with $k$ -multiset $\tilde{\boldsymbol{v}}$ and graph $G^{\prime}$ with $k$ -multiset $\tilde{\boldsymbol{v}}^{\prime}$ , we are going to prove that $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})\Longleftrightarrow$ Player II has a winning strategy for $t$ -step game $\text{DBP}_{k}(G,\tilde{\boldsymbol{v}},G^{\prime},\tilde{\boldsymbol{v^{\prime}}})$ .

We now prove it by induction on $t$ . When $t=0$ , it’s obvious that the statement is correct, as $\bm{mwl}_{k}^{(0)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(0)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ is equivalent to $G[\tilde{\boldsymbol{v}}]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}]$ being isomorphic, which implies that Player II can start the game without losing. Now assume that for $t\leq l$ the statement is correct. For step $t=l+1$ , let’s first prove left $\Longrightarrow$ right.

$\bm{mwl}_{k}^{(l+1)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(l+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ . By Eq.(3) we know this is equivalent to
(1) $\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$
(2) $\exists\text{ bijective }h:\tilde{\boldsymbol{v}}\to\tilde{\boldsymbol{v}}^{\prime}$ , $\forall y\in\tilde{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\ \big{|}\ x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\ \big{|}\ x\in V(G^{\prime})\}\mskip-5.0mu\}$ .

Now let’s start the $\text{DBP}_{k}$ game at position ( $\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}$ ). By (2) we know that there exist a $h$ satisfying (2). Then at the first round, we as Player II pick the $h$ as the bijection. Next Player I will choose a $y\in\tilde{\boldsymbol{v}}$ . According to (2), for the $h$ and $y$ we have $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\ \big{|}\ x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\ \big{|}\ x\in V(G^{\prime})\}\mskip-5.0mu\}$ . This implies that there exists a bijection $f:V(G)\to V(G^{\prime})$ such that $\forall x\in V(G),\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])=\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}^{\prime}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ . Hence let Player II pick the $f$ . Now player I will choose a $x\in V(G)$ . Then $\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])=\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}^{\prime}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ implies $\bm{mwl}_{k}^{(0)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])=\bm{mwl}_{k}^{(0)}(G,\tilde{\boldsymbol{v}}^{\prime}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ , hence $G[\tilde{\boldsymbol{v}}_{t}[x/\text{idx}_{\tilde{\boldsymbol{v}}_{t}}(y)]]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}_{t}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}_{t}}(h(y))]]$ are isomorphic. So the game doesn’t end. At the next round, the game starts at the position $(\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)],\tilde{\boldsymbol{v}}^{\prime}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ , and we know $\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])=\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}^{\prime}[f(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ . By the inductive hypothesis, Player II has a strategy to play DBP_k at the new position for $l$ rounds. Hence Player II has a winning strategy to play DBP_k at original position ( $\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}$ ) for $l+1$ rounds.

Next we prove right $\Longrightarrow$ left by showing that $\bm{mwl}_{k}^{(l+1)}(G,\tilde{\boldsymbol{v}})\neq\bm{mwl}_{k}^{(l+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})\Longrightarrow$ Player I has a winning strategy for $l+1$ -step game $\text{DBP}_{k}(G,\tilde{\boldsymbol{v}},G^{\prime},\tilde{\boldsymbol{v^{\prime}}})$ .

$\bm{mwl}_{k}^{(l+1)}(G,\tilde{\boldsymbol{v}})\neq\bm{mwl}_{k}^{(l+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ implies:
(1) $\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}})\neq\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$
or (2) for any bijection $h:\tilde{\boldsymbol{v}}\to\tilde{\boldsymbol{v}}^{\prime}$ , $\exists y\in\tilde{\boldsymbol{v}}$ , such that $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\ \big{|}\ x\in V(G)\}\mskip-5.0mu\}\neq\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\ \big{|}\ x\in V(G^{\prime})\}\mskip-5.0mu\}$

If (1) holds, then by induction we know that Player I has a winning strategy within $l$ -steps hence Player I also has a winning strategy within $l+1$ steps. If (2) holds, then at the first round after Player II picks up a bijection $h$ , Player I can choose the specific $y\in\tilde{\boldsymbol{v}}$ with $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\ \big{|}\ x\in V(G)\}\mskip-5.0mu\}\neq\{\mskip-5.0mu\{\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\ \big{|}\ x\in V(G^{\prime})\}\mskip-5.0mu\}$ . Then no matter which bijection $f:V(G)\to V(G^{\prime})$ Player II chooses, Player I can always choose a $x\in V(G)$ such that $\bm{mwl}_{k}^{(l)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\neq\bm{mwl}_{k}^{(l)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ . Then by induction, Player I has a winning strategy within $l$ -steps for the DBP_k starts at position $(\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)],\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])$ . Hence even if Player I doesn’t win in the first round, he/she can still win in $l+1$ rounds.

Combining both sides we know that the equivalence between $t$ -step DBP_k and $t$ -step $k$ -MultisetWL holds for any $t$ and any $k$ -multisets.

∎

Theorem 11.

$k$ -MultisetWL cannot distinguish G and $G^{\prime}$ at step $t$ , i.e. $\bm{gmwl}_{k}^{(t)}(G)=\bm{gmwl}_{k}^{(t)}(G^{\prime})$ , if and only if Player II has a winning strategy for $t$ -step $\text{DBP}_{k}$ ( $G,G^{\prime})$ .

Proof.

The proof is strict forward with using the proof inside Theorem 3. We just need to show that the pooling of all $k$ -multisets representations is equivalent to Player II finding a bijection $g:\text{Multiset}(V(G)^{k})\to\text{Multiset}(V(G^{\prime})^{k})$ at the first step of the game. We omit that given its simplicity. ∎

A.4 Proofs of Theorems

A.4.1 Bound of Summation of Binomial Coefficients

Derivation of:

\displaystyle\sum_{i=1}^{k}\binom{n}{i}\leq\binom{n}{k}\frac{n-k+1}{n-2k+1}

(15)

Proof.

$\displaystyle\frac{\sum_{i=1}^{k}\binom{n}{i}}{\binom{n}{k}}$	$\displaystyle=\frac{\binom{n}{k}+\binom{n}{k-1}+\binom{n}{k-2}+...}{\binom{n}{k}}$	(16)
	$\displaystyle=1+\frac{k}{n-k+1}+\frac{k(k-1)}{(n-k+1)(n-k+2)}+...$	(17)
	$\displaystyle\leq 1+\frac{k}{n-k+1}+(\frac{k}{n-k+1})^{2}+...$	(18)
	$\displaystyle\leq\frac{n-k+1}{n-2k+1}$	(19)

∎

A.4.2 Proof of Theorem 1

Theorem 1.

Proof.

By induction on $t$ . It’s obvious that when $t=0$ the above statement hold, as both side are equivalent to $G[\tilde{\boldsymbol{v}}]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}]$ being isomorphic to each other. Assume $\leq t$ the above statement is true. For $t+1$ case, by definition the left side is equivalent to $\{\mskip-5.0mu\{\bm{wl}_{k}^{(t+1)}(G,p(\tilde{\boldsymbol{v}}))|p\in\text{perm[k]}\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{wl}_{k}^{(t+1)}(G^{\prime},p(\tilde{\boldsymbol{v}}^{\prime}))|p\in\text{perm[k]}\}\mskip-5.0mu\}$ . Let $\overrightarrow{\boldsymbol{v}}$ be the ordered version of $\tilde{\boldsymbol{v}}$ following canonical ordering over $G$ , then there exists a bijective mapping $f$ between $\tilde{\boldsymbol{v}}$ and $\tilde{\boldsymbol{v}}^{\prime}$ , such that $\bm{wl}_{k}^{(t+1)}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}_{k}^{(t+1)}(G^{\prime},f(\overrightarrow{\boldsymbol{v}}))$ , where $f(\overrightarrow{\boldsymbol{v}}):=(f(\overrightarrow{\boldsymbol{v}}_{1}),...,f(\overrightarrow{\boldsymbol{v}}_{k}))$ . By [11]’s Theorem 5.2, for any $t$ , $\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G^{\prime},f(\overrightarrow{\boldsymbol{v}}))$ is equivalent to that player II has a winner strategy for $t$ -step pebble game with initial configuration $G,\overrightarrow{\boldsymbol{v}}$ and $G^{\prime},f(\overrightarrow{\boldsymbol{v}})$ (please refer to [11] for the description of pebble game). Notice that applying any permutation to the pebble game’s initial configuration won’t change having winner strategy for player II, hence we know that $\forall p\in\text{perm[k]}$ , $\bm{wl}_{k}^{(t+1)}(G,p(\overrightarrow{\boldsymbol{v}}))=\bm{wl}_{k}^{(t+1)}(G^{\prime},p(f(\overrightarrow{\boldsymbol{v}})))$ . Now applying Eq.(1), we know that (1) $\bm{wl}_{k}^{(t)}(G,p(\overrightarrow{\boldsymbol{v}}))=\bm{wl}_{k}^{(t)}(G^{\prime},p(f(\overrightarrow{\boldsymbol{v}})))$ , and (2) $\forall i\in[k]$ , $\{\mskip-5.0mu\{\bm{wl}_{k}^{(t)}(G,p(\overrightarrow{\boldsymbol{v}}[x/i]))\ |\ x\in V(G)\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{wl}_{k}^{(t)}(G^{\prime},p(f(\overrightarrow{\boldsymbol{v}})[x/i]))\ |\ x\in V(G^{\prime})\}\mskip-5.0mu\}$ . We rewrite (2) as, $\forall i\in[k]$ , $\exists\text{ bijective }h_{i}:V(G)\rightarrow V(G^{\prime})$ , $\forall x\in V(G)$ , $\bm{wl}_{k}^{(t)}(G,p(\overrightarrow{\boldsymbol{v}}[x/i]))=\bm{wl}_{k}^{(t)}(G^{\prime},p(f(\overrightarrow{\boldsymbol{v}})[h_{i}(x)/i]))$ . As (1) and (2) hold for any $p\in\text{perm[k]}$ , now applying induction hypothesis to both (1) and (2), we can get (a) $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ , and (b) $\forall i\in[k]$ , $\exists\text{ bijective }h_{i}:V(G)\rightarrow V(G^{\prime})$ , $\forall x\in V(G)$ , $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/i])=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[h_{i}(x)/g(i)])$ , where $g:[k]\rightarrow[k]$ is the index mapping function corresponding to $f$ . Now we rewrite (b) as $\exists g:[k]\rightarrow[k]$ , $\forall i\in[k]$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/i])\ |\ x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/g(i)])\ |\ x\in V(G^{\prime})\}\mskip-5.0mu\}$ . Combining (a) and (b), using Eq.(3) we can get $\bm{mwl}_{k}^{(t+1)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ . Thus for any $t$ the above statement is correct. ∎

A.4.3 Proof of Theorem 2

CFI Graphs and Their Properties

Cai, Furer and Immerman [11] designed a construction of a series of pairs of graphs CFI(k) such that for any $k$ , $k$ -WL cannot distinguish the pair of graphs CFI $(k)$ . We will use the CFI graph construction to prove the theorem. Here we use a variant of CFI construction that is proposed in Grohe et al.’s work [28]. Let $\mathcal{K}$ he a complete graph with $k$ nodes. We first construct an enlarged graph $\mathcal{X}(\mathcal{K})$ from the base graph $\mathcal{K}$ by mapping each node and edge to a group of vertices and connecting all $(|V(\mathcal{K})|+|E(\mathcal{K})|)$ groups of vertices following certain rules. Notice that we use node for base graph and use vertex for the enlarged graph for a distinction. Let $vw$ denotes the edge connecting node $v$ and node $w$ . For a node $v\in V(\mathcal{K})$ , let $E(v):=\big{\{}vw|vw\in E(\mathcal{K})\big{\}}$ denotes the set of all adjacent edges of $v$ in the graph $\mathcal{K}$ .

For every node $v\in V(\mathcal{K})$ , we map node $v$ to a group of vertices $S_{v}:=\{v^{X}|X\subseteq E(v)\}$ with size $|S_{v}|=2^{deg(v)}=2^{k-1}$ . For every edge $e\in E(\mathcal{K})$ , the construction maps $e$ to two vertices $S_{e}:=\{e^{0},e^{1}\}$ . Hence there are $2*|E(\mathcal{K})|+|V(\mathcal{K})|*2^{k-1}=k(k-1)+k(2^{k-1})$ number of vertices in the enlarged graph $\mathcal{X}(\mathcal{K})$ with $V(\mathcal{X}(\mathcal{K}))=(\cup_{v\in{V(\mathcal{K})}}S_{v})\cup(\cup_{e\in{E(\mathcal{K})}}S_{e})$ . Let $V^{*}:=\cup_{v\in{V(\mathcal{K})}}S_{v}$ and $E^{*}:=\cup_{e\in{E(\mathcal{K})}}S_{e}$ .

Now edges inside $\mathcal{X}(\mathcal{K})$ are defined as follows

	$\displaystyle E(\mathcal{X}(\mathcal{K})):=$	$\displaystyle\{v^{X}e^{1}\ \|\ v\in V(\mathcal{K}),X\subseteq E(v),\text{and }e\in X\}\ \cup$		(20)
		$\displaystyle\{v^{X}e^{0}\ \|\ v\in V(\mathcal{K}),X\subseteq E(v),\text{and }e\not\in X\}\cup\{e^{0}e^{1}\ \|\ e\in E(\mathcal{K})\}$		(21)

What’s more, we also color the vertices such that all vertices inside $S_{v}$ have color $v$ for every $v\in V(\mathcal{K})$ and all vertices inside $S_{e}$ have color $e$ for every $e\in E(\mathcal{K})$ . See Figure.4 top right for the visualization of transforming from base graph $\mathcal{K}$ to $\mathcal{X}(\mathcal{K})$ with $k=3$ .

There are several important properties about the automorphisms of $\mathcal{X}(\mathcal{K})$ . Let $h\in\text{Aut}(\mathcal{X}(\mathcal{K}))$ be an automorphism, then

1.

$h(S_{v})=S_{v}$ and $h(S_{e})=S_{e}$ for all $v\in V(\mathcal{K})$ and $e\in E(\mathcal{K})$ .
2.
For every subset $F\subseteq E(\mathcal{K})$ , there is exactly one automorphism $h_{F}$ that flips precisely all edges in F, i.e. $h_{F}(e^{0})=e^{1}$ and $h_{F}(e^{1})=e^{0}$ if and only if $e\in F$ . More specifically,
- •
  
  $h_{F}(e^{i})=e^{1-i},\ \forall e\in F$
- •
  
  $h_{F}(e^{i})=e^{i},\ \forall e\not\in F$
- •
  
  $h_{F}(v^{X})=v^{Y}$ , $\forall v\in{E(\mathcal{K})},X\subseteq E(v)$
  where $Y:=X\triangle\big{(}F\cap E(v)\big{)}=\Big{(}X\setminus\big{(}F\cap E(v)\big{)}\Big{)}\cup\Big{(}\big{(}F\cap E(v)\big{)}\setminus X\Big{)}$

Proof.

These properties are not hard to prove. First for property 1, it is true based on the coloring rules of vertices in $\mathcal{X}(\mathcal{K})$ . Now for the second property. As $h_{F}$ flips precisely all edges in $F$ , we have $h_{F}(e^{i})=e^{1-i},\ \forall e\in F$ and $h_{F}(e^{i})=e^{i},\ \forall e\not\in F$ . Now let $h_{F}(v^{X})=v^{Y}$ $\forall v\in{E(\mathcal{K})},X\subseteq E(v)$ , we need to figure out $Y$ ’s formulation. Let’s focus on node $v$ without losing generality, and we can partition the set $E(v)$ to two parts: $E(v)\cap F$ and $E(v)\setminus F$ . Then for any $e\in E(v)\cap F$ , we know that $h_{F}(e^{i})=e^{1-i}$ . Let $v\leftrightarrow w$ denote that $v$ and $w$ are connected in $\mathcal{X}(\mathcal{K})$ . Then $\forall X\subseteq E(v)$ , $e\in X\Longleftrightarrow e^{0}\leftrightarrow v^{X}\Longleftrightarrow h_{F}(e^{0})\leftrightarrow h_{F}(v^{X})\Longleftrightarrow e^{1}\leftrightarrow v^{Y}\Longleftrightarrow e\not\in Y$ . And similarly we have $\forall X\subseteq E(v)$ , $e\not\in X\Longleftrightarrow e\in Y$ . Hence $\forall e\in E(v)\cap F,\forall X\subseteq E(v)$ , $e\in X\triangle Y$ . This implies that $E(v)\cap F\subseteq X\triangle Y$ . Following the same logic we can also get $\forall e\in E(v)\setminus F,\forall X\subseteq E(v)$ , $e\in E(v)\setminus(X\triangle Y)$ , which is equivalent to $E(v)\setminus F\subseteq E(v)\setminus(X\triangle Y)$ , which further implies that $E(v)\cap F\supseteq X\triangle Y$ as $X\triangle Y\subseteq E(v)$ . Then combining both side we know that $\forall v\in E(\mathcal{K}),\forall X\subseteq E(v),E(v)\cap F=X\triangle Y$ . Hence we get $Y=X\triangle(X\triangle Y)=X\triangle\big{(}F\cap E(v)\big{)}$ . ∎

In the proof we can also know another important property. That is $\forall v,X\subseteq E(v),X\triangle h_{F}(X)=E(v)\cap F$ is constant for any input $X\in E(v)$ .

Now we are ready to construct variants of graphs that are not isomorphic from the enlarged graph $\mathcal{X}(\mathcal{K})$ . Now let T be a subset of $V(\mathcal{K})$ . Now we define an induced subgraph $\mathcal{X}_{T}(\mathcal{K})$ of the enlarged graph $\mathcal{X}(\mathcal{K})$ . Specifically, we define the new node group as follows

\displaystyle S_{v}^{T}:=\begin{cases}\{v^{X}\in S_{v}\ |\ |X|\equiv 0\mod 2\}\text{ if }v\not\in T\\ \{v^{X}\in S_{v}\ |\ |X|\equiv 1\mod 2\}\text{ if }v\in T\end{cases}

(22)

Then the induced subgraph is defined as $\mathcal{X}_{T}(\mathcal{K}):=\mathcal{X}(\mathcal{K})[(\cup_{v\in{V(\mathcal{K})}}S^{T}_{v})\cup E^{*}]$ . In Figure 4 we show $\mathcal{X}_{\emptyset}(\mathcal{K})$ in bottom left (labeled with $\mathcal{X}_{0}(\mathcal{K})$ ) and $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ in the bottom right (labeled with $\mathcal{X}_{1}(\mathcal{K})$ ).

Lemma 1.

For all $T,U\subseteq V(\mathcal{K})$ , $\mathcal{X}_{T}(\mathcal{K})\cong\mathcal{X}_{U}(\mathcal{K})$ if and only if $|T|\equiv|U|\mod 2$ . And if they are isomorphic, one isomorphism between $\mathcal{X}_{T}(\mathcal{K})$ and $\mathcal{X}_{U}(\mathcal{K})$ is $h_{F}$ with $F=E(\mathcal{K}[T\triangle U])$ , where $E(\mathcal{K}[T\triangle U])$ denotes the set of all edges $\{v_{i},v_{j}\}\subseteq T\triangle U$ .

Notice that $h_{F}$ is an automorphism for $\mathcal{X}(\mathcal{K})$ , but with restricting its domain to $(\cup_{v\in{V(\mathcal{K})}}S^{T}_{v})\cup E^{*}$ it becomes the isomorphism between $\mathcal{X}_{T}(\mathcal{K})$ and $\mathcal{X}_{U}(\mathcal{K})$ .

The proof of the lemma can be found in [11] and [28]. With this lemma we know that $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ and $\mathcal{X}_{\emptyset}(\mathcal{K})$ are not isomorphic. In next part we show that $\mathcal{X}_{\emptyset}(\mathcal{K})$ and $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ cannot be distinguished by $(k-1)$ -WL but can be distinguished by $k$ -MultisetWL, thus proving Theorem 2.

Main Proof

Theorem 2.

$k$ -MultisetWL is no less powerful than ( $k$ - $1$ )-WL in distinguishing graphs: for $k\geq 3$ there is a pair of graphs that can be distinguished by $k$ -MultisetWL but not by ( $k$ - $1$ )-WL.

Proof.

To prove the theorem, we show that for any $k\geq 3$ , $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ and $\mathcal{X}_{\emptyset}(\mathcal{K})$ , defined previously, are two nonisomorphic graphs that can be distinguished by $k$ -MultisetWL but not by $(k-1)$ -WL. It’s well known that these two graphs cannot be distinguished by $(k-1)$ -WL, and one can refer to Theorem 5.17 in [28] for its proof. Now to prove these two graphs can be distinguished by $k$ -MultisetWL, using Theorem 11 we know it’s equivalent to show that Player I has a winning strategy for the doubly bijective $k$ -pebble game $\text{DBP}_{k}(\mathcal{X}_{\emptyset}(\mathcal{K}),\mathcal{X}_{\{v_{1}\}}(\mathcal{K}))$ .

At the start of the pebble game $\text{DBP}_{k}(\mathcal{X}_{\emptyset}(\mathcal{K}),\mathcal{X}_{\{v_{1}\}}(\mathcal{K}))$ , Player II is asked to provide a bijection between all $k$ -multisets of $\mathcal{X}_{\emptyset}(\mathcal{K})$ to all $k$ -multisets of $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ . For any bijection the Player II chosen, the Player I can pickup $\tilde{\boldsymbol{v}}_{0}=\{\mskip-5.0mu\{v_{1}^{\emptyset},v_{2}^{\emptyset},...,v_{k}^{\emptyset}\}\mskip-5.0mu\}$ , with corresponding position $\tilde{\boldsymbol{v}}^{\prime}_{0}=\{\mskip-5.0mu\{v_{1}^{X_{1}},v_{2}^{X_{2}},...,v_{k}^{X_{k}}\}\mskip-5.0mu\}$ . Notice that for a position $(\tilde{\boldsymbol{x}},\tilde{\boldsymbol{y}}):=(\{\mskip-5.0mu\{x_{1},...,x_{k}\}\mskip-5.0mu\},\{\mskip-5.0mu\{y_{1},...,y_{k}\}\mskip-5.0mu\})$ , the position is called consistent if there exists a $F\subseteq E(\mathcal{K})$ , such that $h_{F}(\tilde{\boldsymbol{x}})=\{\mskip-5.0mu\{h_{F}(x_{1}),...,h_{F}(x_{k})\}\mskip-5.0mu\}=\tilde{\boldsymbol{y}}$ . One can show that Player I can easily win the game $\text{DBP}_{k}(\mathcal{X}_{\emptyset}(\mathcal{K}),\mathcal{X}_{\{v_{1}\}}(\mathcal{K}))$ after one additional step if the current position $(\tilde{\boldsymbol{x}},\tilde{\boldsymbol{y}})$ is not consistent, even if $\mathcal{X}_{\emptyset}(\mathcal{K})[\tilde{\boldsymbol{x}}]$ and $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})[\tilde{\boldsymbol{y}}]$ are isomorphic [28].

We claim that for any $k$ -multiset $\tilde{\boldsymbol{v}}^{\prime}_{0}=\{\mskip-5.0mu\{v_{1}^{X_{1}},v_{2}^{X_{2}},...,v_{k}^{X_{k}}\}\mskip-5.0mu\}$ the position $(\tilde{\boldsymbol{v}}_{0},\tilde{\boldsymbol{v}}^{\prime}_{0})$ cannot be consistent, and thus Player II loses $\text{DBP}_{k}(\mathcal{X}_{\emptyset}(\mathcal{K}),\mathcal{X}_{\{v_{1}\}}(\mathcal{K}))$ after the first round. Assume that there exists a subset $F\subseteq E(\mathcal{K})$ , such that $h_{F}(\tilde{\boldsymbol{v}}_{0})=\tilde{\boldsymbol{v}}^{\prime}_{0}$ . By property 1 of automorphisms of $\mathcal{X}(\mathcal{K})$ we have that $h_{F}(S_{v_{1}})=S_{v_{1}}$ , and since $\tilde{\boldsymbol{v}}_{0}$ contains exactly one vertex of each color, it also holds that $h_{F}(v_{1}^{\emptyset})=v_{1}^{X_{1}}$ . It follows from property 2 that $X_{1}=\emptyset\triangle\big{(}F\cap E(v_{1})\big{)}=F\cap E(v_{1})$ . Based on the definition of $\mathcal{X}_{\{v_{1}\}}(\mathcal{K})$ , we know that $|X_{1}|\equiv 1\mod 2$ , and thus $F$ flips an odd number of neighbors of $v_{1}$ , i.e. $|F\cap E(v_{1})|\equiv 1\mod 2$ . Similarly, we have that $h_{F}(S_{v_{i}})=S_{v_{i}}$ and $|X_{i}|\equiv 0\mod 2,\forall i\geq 2$ , hence $|F\cap E(v_{i})|\equiv 0\mod 2$ . It follows from a simple handshake argument that there exists an $m\geq 0$ such that

2m=\sum_{v\in V}|(F\cap E)(v)|=\sum_{i\in[k]\setminus\{1\}}|F\cap E(v_{i})|+|F\cap E(v_{1})|\equiv 1\mod 2,

which is a contradiction. Hence Player I has a winning strategy for $\text{DBP}_{k}(\mathcal{X}_{\emptyset}(\mathcal{K}),\mathcal{X}_{\{v_{1}\}}(\mathcal{K}))$ .

∎

A.4.4 Proof of Theorem 3

See Appendix.A.3.

A.4.5 Proof of Theorem 4

Theorem 4.

Proof.

Notation: Let $s(\cdot)$ transform a multiset to set by removing repeats, and let $r(\cdot)$ return a tuple with the number of repeats for each distinct element in a multiset. Let $f$ be the inverse mapping such that $\tilde{\boldsymbol{v}}=f(s(\tilde{\boldsymbol{v}}),r(\tilde{\boldsymbol{v}}))$ .

Define $F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}):=\{\text{ injective }f:\tilde{\boldsymbol{v}}\rightarrow\tilde{\boldsymbol{v}}^{\prime}\ |\ f\in F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}),\text{ AND },\forall y\in\tilde{\boldsymbol{v}},\exists h_{y}:V(G)\rightarrow V(G^{\prime}),\forall x,f\in F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)],\tilde{\boldsymbol{v}}^{\prime}[h_{y}(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(f(y))])\}$ . Let $F^{(0)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}):=\{f\ |\ f\text{ is an isomorphism from }G[\tilde{\boldsymbol{v}}]\text{ to }G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}]\}$ .

Lemma 2.

$\forall t$ , $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ $\Longleftrightarrow$ $\forall h\in F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ , $\forall\hat{\boldsymbol{n}}\text{ with }(\sum_{i=1}^{m}\hat{\boldsymbol{n}}_{i}=k,\forall i\ \hat{\boldsymbol{n}}_{i}\geq 1)$ , $\bm{mwl}_{k}^{(t)}(G,f(s(\tilde{\boldsymbol{v}}),\hat{\boldsymbol{n}}))=\bm{mwl}_{k}^{(t)}(G^{\prime},f(h(s(\tilde{\boldsymbol{v}})),\hat{\boldsymbol{n}}))$ .

When $t=0$ , by the definition of $\bm{swl}_{k}^{(0)}$ , the statement is true. Now hypothesize that the statement is true for $\leq t$ case. For $t+1$ case, the left side implies that existing a $\tilde{\boldsymbol{v}}$ with $s(\tilde{\boldsymbol{v}})=\hat{\boldsymbol{v}}$ and a $\tilde{\boldsymbol{v}}^{\prime}$ with $s(\tilde{\boldsymbol{v}}^{\prime})=\hat{\boldsymbol{v}}^{\prime}$ , such that $\bm{mwl}_{k}^{(t+1)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ . And for a mapping $h\in F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ , we have $\forall\hat{\boldsymbol{n}}\text{ with }(\sum_{i=1}^{m}\hat{\boldsymbol{n}}_{i}=k,\forall i\ \hat{\boldsymbol{n}}_{i}\geq 1)$ , $\bm{mwl}_{k}^{(t+1)}(G,f(s(\tilde{\boldsymbol{v}}),\hat{\boldsymbol{n}}))=\bm{mwl}_{k}^{(t+1)}(G^{\prime},f(h(s(\tilde{\boldsymbol{v}})),\hat{\boldsymbol{n}}))$ .

For a specific $\hat{\boldsymbol{n}}$ , define $\tilde{\boldsymbol{u}}=f(s(\tilde{\boldsymbol{v}}),\hat{\boldsymbol{n}})$ and $\tilde{\boldsymbol{u}}^{\prime}=f(h(s(\tilde{\boldsymbol{v}})),\hat{\boldsymbol{n}})=h(\tilde{\boldsymbol{u}})$ . By Eq.(3):
(1) $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{u}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{u}}^{\prime})$ ;
(2) $\forall f\in{F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{u}},\tilde{\boldsymbol{u}}^{\prime})}$ , $\forall y\in\tilde{\boldsymbol{u}}$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{u}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{u}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(f(y))])\big{|}x\in V(G^{\prime})\}\mskip-5.0mu\}$ .

As $h\in{F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})}={F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{u}},\tilde{\boldsymbol{u}}^{\prime})}$ , we can change (2) by choosing $f$ as $h$ . Hence we update it as:
(2) $\forall y\in\tilde{\boldsymbol{u}}$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{u}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},h(\tilde{\boldsymbol{u}})[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\big{|}x\in V(G^{\prime})\}\mskip-5.0mu\}$
And (2) can be split into two parts:
(2) $\forall y\in\tilde{\boldsymbol{u}}$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{u}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in s(\tilde{\boldsymbol{u}})\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},h(\tilde{\boldsymbol{u}})[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\big{|}x\in s(h(\tilde{\boldsymbol{u}}))\}\mskip-5.0mu\}$ and
$\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{u}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in V(G)\setminus s(\tilde{\boldsymbol{u}})\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},h(\tilde{\boldsymbol{u}})[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(h(y))])\big{|}x\in V(G^{\prime})\setminus s(h(\tilde{\boldsymbol{u}}))\}\mskip-5.0mu\}$

Combining the Lemma.2 and induction hypothesis, also knowing that $s(\tilde{\boldsymbol{u}})=\hat{\boldsymbol{v}}$ and $s(\tilde{\boldsymbol{u}}^{\prime})=\hat{\boldsymbol{v}}^{\prime}$ , we get:
(a) $\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ ;
(b) $\forall y\in\hat{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}\setminus y)\big{|}x\in\hat{\boldsymbol{v}}\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime}\setminus h(y))\big{|}x\in\hat{\boldsymbol{v}}^{\prime}\}\mskip-5.0mu\}$ , this is derived with picking $\hat{\boldsymbol{n}}$ with $\hat{\boldsymbol{n}}[y]=1$ , such that $\tilde{\boldsymbol{u}}$ only has 1 $y$ and $\tilde{\boldsymbol{u}}^{\prime}$ only has 1 $h(y)$ .
(c) $\forall y\in\hat{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}\cup\{x\})\big{|}x\in V(G)\setminus\hat{\boldsymbol{v}}\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime}\cup\{x\})\big{|}x\in V(G^{\prime})\setminus\hat{\boldsymbol{v}}^{\prime}\}\mskip-5.0mu\}$ , this is derived with picking $\hat{\boldsymbol{n}}$ with $\hat{\boldsymbol{n}}[y]>1$ .
(d) $\forall y\in\hat{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G,\hat{\boldsymbol{v}}\setminus y\cup\{x\})\big{|}x\in V(G)\setminus\hat{\boldsymbol{v}}\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{swl}_{k}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime}\setminus h(y)\cup\{x\})\big{|}x\in V(G^{\prime})\setminus\hat{\boldsymbol{v}}^{\prime}\}\mskip-5.0mu\}$ , this is derived with picking $\hat{\boldsymbol{n}}$ with $\hat{\boldsymbol{n}}[y]=1$ , such that $\tilde{\boldsymbol{u}}$ only has 1 $y$ and $\tilde{\boldsymbol{u}}^{\prime}$ only has 1 $h(y)$ .

Now combining (a) (b) (c) (d) and using Eq.(5), we can get that $\bm{swl}_{k}^{(t+1)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ , and we proved the statement.

∎

A.4.6 Proof of Theorem 5

Theorem 5.

Let $k\geq 1$ , then $\forall t\in\mathbb{N}$ and all graphs $G,G^{\prime}$ :

(1)

when $1\leq c_{1}<c_{2}\leq k$ , if $G,G^{\prime}$ cannot be distinguished by $(k,c_{2})(\leq)$ -SetWL, they cannot be distinguished by $(k,c_{1})(\leq)$ -SetWL
(2)

when $k_{1}<k_{2}$ , $\forall c\leq k_{1}$ , if $G,G^{\prime}$ cannot be distinguished by $(k_{2},c)(\leq)$ -SetWL, they cannot be distinguished by $(k_{1},c)(\leq)$ -SetWL

Proof.

We will prove (2) first, and the proof for (1) follows the same argument. To help understand, we present the formulation for $(k,c)(\leq)$ -SetWL first.

	$\displaystyle\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}})=\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k,c,\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$		(23)
	$\displaystyle\bm{swl}_{k,c}^{(t+1)}(G,\hat{\boldsymbol{v}})=\bigg{(}\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{v}}),\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}}),\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k,c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\},$		(24)
	$\displaystyle\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k,c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}\bigg{)}$		(25)

Lemma 3.

For $k_{1}<k_{2}$ , for any $t$ , for any $\hat{\boldsymbol{v}}$ and $\hat{\boldsymbol{v}}^{\prime}$ with $\leq k_{1}$ nodes inside, $\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ $\Longrightarrow$ $\bm{swl}_{k_{1},c}^{(t)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k_{1},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ .

Proof of Lemma.3:

Proof.

We prove it by induction. As the color initialization stage of $(k_{1},c)(\leq)$ -SetWL and $(k_{2},c)(\leq)$ -SetWL are the same, when $t=0$ the statement of the Lemma is correct. Assume it holds correct for $\leq t$ . When $t+1$ , $\bm{swl}_{k_{2},c}^{(t+1)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k_{2},c}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ implies:

(1) $\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(2) $\bm{swl}_{k_{2},c}^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k_{2},c}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(3) $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$
(4) $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t+1/2)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$

(1) + induction hypothesis $\Longrightarrow$ (a) $\bm{swl}_{k_{1},c}^{(t)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k_{1},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(2) has two situations: $|\hat{\boldsymbol{v}}|=|\hat{\boldsymbol{v}}^{\prime}|<k_{1}$ and $|\hat{\boldsymbol{v}}|=|\hat{\boldsymbol{v}}^{\prime}|=k_{1}$ . For the first situation, $\mathcal{N}^{G}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}})=\mathcal{N}^{G}_{k_{1},c,\text{right}}(\hat{\boldsymbol{v}})$ and $\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}}^{\prime})=\mathcal{N}^{G^{\prime}}_{k_{1},c,\text{right}}(\hat{\boldsymbol{v}}^{\prime})$ , $\exists$ a bijective mapping $b$ between $\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}}^{\prime})$ and $\mathcal{N}^{G}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}})$ such that $\forall\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}})$ , $\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{u}})$ = $\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},b(\hat{\boldsymbol{u}}))$ and by induction $\bm{swl}_{k_{1},c}^{(t)}(G,\hat{\boldsymbol{u}})$ = $\bm{swl}_{k_{1},c}^{(t)}(G^{\prime},b(\hat{\boldsymbol{u}}))$ , hence $\bm{swl}_{k_{1},c}^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k_{1},c}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ . For the second situation, $\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{right}}(\hat{\boldsymbol{v}}^{\prime})$ elements while $\mathcal{N}^{G^{\prime}}_{k_{1},c,\text{right}}(\hat{\boldsymbol{v}}^{\prime})=\emptyset$ , then clearly (b) $\bm{swl}_{k_{1},c}^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k_{1},c}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ (both are empty multisets).
(3) follows the same argument step in (2)’s first situation, with induction hypothesis we can get (c) $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k_{1},c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{k_{1},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$
(4) implies that $\exists$ a mapping $b$ between $\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})$ and $\mathcal{N}^{G}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}})$ such that for any $\hat{\boldsymbol{u}}\in\mathcal{N}^{G^{\prime}}_{k_{2},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})$ , $\bm{swl}_{k_{2},c}^{(t+1/2)}(G,\hat{\boldsymbol{u}})$ = $\bm{swl}_{k_{2},c}^{(t+1/2)}(G^{\prime},b(\hat{\boldsymbol{u}}^{\prime}))$ . Using the same argument in (2) and induction hypothesis, this implies that for any $\hat{\boldsymbol{u}}\in\mathcal{N}^{G^{\prime}}_{k_{1},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})$ , $\bm{swl}_{k_{1},c}^{(t+1/2)}(G,\hat{\boldsymbol{u}})$ = $\bm{swl}_{k_{1},c}^{(t+1/2)}(G^{\prime},b(\hat{\boldsymbol{u}}^{\prime}))$ . Hence we get (d) $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t+1/2)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{k_{1},c,\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{k_{1},c,\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$ .

Combining (a) (b) (c) (d) we get $\bm{swl}_{k_{1},c}^{(t+1)}(G,\hat{\boldsymbol{v}})=\bm{swl}_{k_{1},c}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ . ∎

With Lemma.3 we are ready to prove (2) in Theorem 3. When two graphs $G$ and $G^{\prime}$ cannot be distinguished by $(k_{2},c)(\leq)$ -SetWL, we have $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{v}})\ |\ \hat{\boldsymbol{v}}\in V(S_{k_{2},c\text{-swl}}(G))\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})\ |\ \hat{\boldsymbol{v}}^{\prime}\in V(S_{k_{2},c\text{-swl}}(G^{\prime}))\}\mskip-5.0mu\}$ . When $\hat{\boldsymbol{v}}$ and $\hat{\boldsymbol{u}}$ have different $k$ (number of nodes) and $c$ (number of components of its induced subgraph), their color cannot be the same (as $t=0$ already be different). Hence $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{v}})\ |\ \hat{\boldsymbol{v}}\in V(S_{k_{2},c\text{-swl}}(G))\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})\ |\ \hat{\boldsymbol{v}}^{\prime}\in V(S_{k_{2},c\text{-swl}}(G^{\prime}))\}\mskip-5.0mu\}$ is equivalent to $\forall k\leq k_{2},cc\leq c$ , $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G,\hat{\boldsymbol{v}})\ |\ \hat{\boldsymbol{v}}\in V_{k,cc}(S_{k_{2},c\text{-swl}}(G))\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{2},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})\ |\ \hat{\boldsymbol{v}}^{\prime}\in V_{k,cc}(S_{k_{2},c\text{-swl}}(G^{\prime}))\}\mskip-5.0mu\}$ . With Lemma.3 we can get that $\forall k\leq k_{1},cc\leq c$ , $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t)}(G,\hat{\boldsymbol{v}})\ |\ \hat{\boldsymbol{v}}\in V_{k,cc}(S_{k_{1},c\text{-swl}}(G))\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}_{k_{1},c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})\ |\ \hat{\boldsymbol{v}}^{\prime}\in V_{k,cc}(S_{k_{1},c\text{-swl}}(G^{\prime}))\}\mskip-5.0mu\}$ . Hence two graphs $G$ and $G^{\prime}$ cannot be distinguished by $(k_{1},c)(\leq)$ -SetWL.

∎

A.4.7 Proof of Theorem 6

Theorem 6.

Proof.

For any $G,\hat{\boldsymbol{v}}$ , let $\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{v}})$ denotes the color of $\hat{\boldsymbol{v}}$ at $t$ -iteration $(k,c)(\leq)$ -SetWL and $h_{k,c}^{(t)}(G,\hat{\boldsymbol{v}})$ denotes the embedding of $\hat{\boldsymbol{v}}$ at $t$ -th layer $(k,c)(\leq)$ -SetGNN. We prove the above theorem by showing that $\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}_{k,c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ $\Longleftrightarrow$ $h_{k,c}^{(t)}(G,\hat{\boldsymbol{v}})$ = $h_{k,c}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ . We remove the subscript $k,c$ when possible without introducing confusion.

For easier reference, recall the updating formulation for $t$ -iteration $(k,c)(\leq)$ -SetWL is

$\displaystyle\bm{swl}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}})=$	$\displaystyle\{\mskip-5.0mu\{\bm{swl}_{k,c}^{(t)}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$	(26)
$\displaystyle\bm{swl}^{(t+1)}(G,\hat{\boldsymbol{v}})=$	$\displaystyle\bigg{(}\bm{swl}^{(t)}(G,\hat{\boldsymbol{v}}),\bm{swl}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{v}}),$	(27)
	$\displaystyle\{\mskip-5.0mu\{\bm{swl}^{(t)}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\},\{\mskip-5.0mu\{\bm{swl}^{(t+\frac{1}{2})}(G,\hat{\boldsymbol{u}})\ \|\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}\bigg{)}$	(28)

And the updating formulation for $(k,c)(\leq)$ -SetGNN is

$\displaystyle h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}})=$	$\displaystyle\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t+\frac{1}{2})}(h^{(t)}(\hat{\boldsymbol{u}}))$	(29)
$\displaystyle h^{(t+1)}(\hat{\boldsymbol{v}})=$	$\displaystyle\text{MLP}^{(t)}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}),$	(30)
	$\displaystyle\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t)}_{A}(h^{(t)}(\hat{\boldsymbol{u}})),\sum_{\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}^{(t)}_{B}(h^{(t+\frac{1}{2})}(\hat{\boldsymbol{u}}))\Big{)}$	(31)

At $t=0$ , given powerful enough BaseGNN with condition (i) in the theorem, $h^{(0)}(G,\hat{\boldsymbol{v}})$ = $h^{(0)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ $\Longleftrightarrow$ $G[\hat{\boldsymbol{v}}]$ and $G^{\prime}[\hat{\boldsymbol{v}}^{\prime}]$ are isomorphic $\Longleftrightarrow$ $\bm{swl}^{(0)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(0)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ .

Now assume for $\leq t$ iterations the claim $\bm{swl}^{(t)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ $\Longleftrightarrow$ $h^{(t)}(G,\hat{\boldsymbol{v}})$ = $h^{(t)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ holds (for any $\hat{\boldsymbol{v}}$ and $\hat{\boldsymbol{v}}^{\prime}$ ). We prove it holds for $t+1$ iteration. We first prove the forward direction. $\bm{swl}^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ imples that

(1) $\bm{swl}^{(t)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(2) $\bm{swl}^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(3) $\{\mskip-5.0mu\{\bm{swl}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$
(4) $\{\mskip-5.0mu\{\bm{swl}^{(t+1/2)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{swl}^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$

(1) and (3) can be directly transformed to relationship between $h$ using the inductive hypothesis. Formally, we have
(a) $h^{(t)}(G,\hat{\boldsymbol{v}})$ = $h^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ and
(c) $\{\mskip-5.0mu\{h^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$

(2) and (4) need one additional process. Notice that (2) is equivalent to $\{\mskip-5.0mu\{\bm{swl}^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{swl}^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G^{\prime}}_{\text{right}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$ and by inductive hypothesis we know $\{\mskip-5.0mu\{h^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}=\{\mskip-5.0mu\{h^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G^{\prime}}_{\text{right}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$ . As the formulation Eq.(29) applies a MLP with summation, which is permutation invariant to ordering, to the multiset $\{\mskip-5.0mu\{h^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ , the same multiset leads to the same output. Hence we know
(b) $h^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $h^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$ .

For (4) we know that there is a bijective mapping $g$ between $\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})$ and $\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})$ such that $\bm{swl}^{(t+1/2)}(G,\hat{\boldsymbol{u}})=\bm{swl}^{(t+1/2)}(G^{\prime},g(\hat{\boldsymbol{u}}))$ . Then using the same argument as (2) =>(b) we can get $h^{(t+1/2)}(G,\hat{\boldsymbol{u}})=h^{(t+1/2)}(G^{\prime},g(\hat{\boldsymbol{u}}))$ for any $\hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})$ , which is equivalent to
(d) $\{\mskip-5.0mu\{h^{(t+1/2)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$ .

Combining (a)(b)(c)(d) and using the permutation invariant property of MLP with summation, we can derive that $h^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $h^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ .

Now for the backward direction. We first characterize the property of MLP with summation from DeepSet [64] and GIN [61]’s Lemma 5.

Lemma 4 (Lemma 5 of [61]).

Assume $\mathcal{X}$ is countable. There exists a function $f:\mathcal{X}\rightarrow\mathbb{R}^{n}$ so that $h(X)=\sum_{x\in X}f(x)$ is unique for each multiset $X\in\mathcal{X}$ of bounded size.

Using this Lemma, we know that given enough depth and width of a MLP, there exist a MLP that $\sum_{x\in X}\text{MLP}(x)=\sum_{y\in Y}\text{MLP}(y)\Longleftrightarrow X=Y$ or in other words two multisets $X$ and $Y$ are equivalent. Now by $h^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $h^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ and using the Eq.(31), we know there exists MLPs inside Eq.(29) and Eq.(31), such that

(1) $h^{(t)}(G,\hat{\boldsymbol{v}})$ = $h^{(t)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(2) $h^{(t+1/2)}(G,\hat{\boldsymbol{v}})$ = $h^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{v}}^{\prime})$
(3) $\{\mskip-5.0mu\{h^{(t)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$
(4) $\{\mskip-5.0mu\{h^{(t+1/2)}(G,\hat{\boldsymbol{u}})\ |\ \hat{\boldsymbol{u}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t+1/2)}(G^{\prime},\hat{\boldsymbol{u}}^{\prime})\ |\ \hat{\boldsymbol{u}}^{\prime}\in\mathcal{N}^{G^{\prime}}_{\text{left}}(\hat{\boldsymbol{v}}^{\prime})\}\mskip-5.0mu\}$

Where (3) and (4) are derived using the provided lemma. Hence following the same argument as the forward process, we can prove that $\bm{swl}^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ .

Combining the forward and backward direction, we have $\bm{swl}^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $\bm{swl}^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ $\Longleftrightarrow$ $h^{(t+1)}(G,\hat{\boldsymbol{v}})$ = $h^{(t+1)}(G^{\prime},\hat{\boldsymbol{v}}\textquoteright)$ . Hence by induction we proved for any step $t$ and any $\hat{\boldsymbol{v}},\hat{\boldsymbol{v}}^{\prime}$ , the above statement is true. This shows that $(k,c)(\leq)$ -SetGNN and $(k,c)(\leq)$ -SetWL have same expressivity.

∎

A.4.8 Proof of Theorem 7

Theorem 7.

Proof.

We first prove that $t$ -layer $(k,c)(\leq)$ -SetGNN^∗ is more expressive than $t$ -layer $(k,c)(\leq)$ -SetGNN, by showing that if a $m$ -set $\hat{\boldsymbol{u}}$ has the same representation to another $m$ -set $\hat{\boldsymbol{v}}$ in $t$ -layer $(k,c)(\leq)$ -SetGNN^∗, then they also have the same representation in $t$ -layer $(k,c)(\leq)$ -SetGNN. Let $h$ be the representation inside $(k,c)(\leq)$ -SetGNN^∗, and $g$ be the representation inside $(k,c)(\leq)$ -SetGNN.

To simplify the proof, we first simplify the formulations of $(k,c)(\leq)$ -SetGNN^∗ Eq.13 and Eq.14 by removing unnecessary super and under script of MLP without introducing ambiguity. We then add another superscript to embeddings to indicate which step inside the one-layer bidirectional propagation. Notice that for one-layer bidirectional propagation, there are $2k-2$ intermediate sequential steps. The proof assumes that all MLPs are injective. We rewrite Eq.13 and Eq.14 as follows

	$\displaystyle\forall m\text{-set }\hat{\boldsymbol{v}},h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}):=h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}})=\text{MLP}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}(h^{{(t+\frac{1}{2},k-m-1)}}(\hat{\boldsymbol{w}}))\Big{)}$		(32)
	$\displaystyle\forall m\text{-set }\hat{\boldsymbol{v}},h^{(t+1)}(\hat{\boldsymbol{v}}):=h^{(t+1,m-1)}(\hat{\boldsymbol{v}})=\text{MLP}\Big{(}h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}(h^{{(t+1,m-2)}}(\hat{\boldsymbol{w}}))\Big{)}$		(33)

Where we have boundary case with $h^{{(t+\frac{1}{2},0)}}(\hat{\boldsymbol{u}})=h^{{(t)}}(\hat{\boldsymbol{u}})$ and $h^{{(t+1,0)}}(\hat{\boldsymbol{u}})=h^{{(t+\frac{1}{2})}}(\hat{\boldsymbol{u}})$ . $h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}):=h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}})$ represents that the representation is calculated at $k-m$ step for $t+\frac{1}{2}$ layer.

We prove the theorem by induction on $t$ . Specifically, we want to prove that for any $t$ , $\hat{\boldsymbol{v}}$ , $\hat{\boldsymbol{u}}$ , $h^{(t)}(\hat{\boldsymbol{u}})=h^{(t)}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t)}(\hat{\boldsymbol{u}})=g^{(t)}(\hat{\boldsymbol{v}})$ and $h^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=h^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=g^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})$ . The base case is easy to verify as the definition of initialization step is the same.

Now assume that for $t\leq l$ we have for any $t$ , $\hat{\boldsymbol{v}}$ , $\hat{\boldsymbol{u}}$ , $h^{(t)}(\hat{\boldsymbol{u}})=h^{(t)}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t)}(\hat{\boldsymbol{u}})=g^{(t)}(\hat{\boldsymbol{v}})$ and $h^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=h^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=g^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})$ . We first prove that for $t=l+1$ , $h^{(l+\frac{1}{2})}(\hat{\boldsymbol{u}})=h^{(l+\frac{1}{2})}(\hat{\boldsymbol{v}})\Longrightarrow g^{(l+\frac{1}{2})}(\hat{\boldsymbol{u}})=g^{(l+\frac{1}{2})}(\hat{\boldsymbol{v}})$ .

First, Eq.32 can be rewrite as

	$\displaystyle h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}}):h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}})=\text{MLP}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}(h^{{(t+\frac{1}{2},k-m-1)}}(\hat{\boldsymbol{w}}))\Big{)}$		(34)
	$\displaystyle=\text{MLP}\Big{(}h^{(t)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}(\text{MLP}(h^{(t)}(\hat{\boldsymbol{w}})),\sum_{\hat{\boldsymbol{p}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{w}})}\text{MLP}(h^{{(t+\frac{1}{2},k-m-2)}}(\hat{\boldsymbol{p}})))\Big{)}$		(35)

Then $h^{(l+\frac{1}{2})}(\hat{\boldsymbol{v}})=h^{(l+\frac{1}{2})}(\hat{\boldsymbol{u}})\Longrightarrow\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}(h^{(l)}(\hat{\boldsymbol{w}}))=\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{u}})}\text{MLP}(h^{(l)}(\hat{\boldsymbol{w}}))$ . Hence we can find a bijective mapping $f$ between $\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})$ and $\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{u}})$ such that $\text{MLP}(h^{(l)}(\hat{\boldsymbol{w}}))=\text{MLP}(h^{(l)}(f(\hat{\boldsymbol{w}})))$ . Then by inductive hyphothesis, $\text{MLP}(g^{(l)}(\hat{\boldsymbol{w}}))=\text{MLP}(g^{(l)}(f(\hat{\boldsymbol{w}})))$ for all $\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})$ . This implies that $\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{v}})}\text{MLP}(g^{(l)}(\hat{\boldsymbol{w}}))=\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{right}}(\hat{\boldsymbol{u}})}\text{MLP}(g^{(l)}(\hat{\boldsymbol{w}}))$ or equivalently $g^{(l+\frac{1}{2})}(\hat{\boldsymbol{v}})=g^{(l+\frac{1}{2})}(\hat{\boldsymbol{u}})$ .

Now we can assume that for $t\leq m$ we have for any $t$ , $\hat{\boldsymbol{v}}$ , $\hat{\boldsymbol{u}}$ , $h^{(t)}(\hat{\boldsymbol{u}})=h^{(t)}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t)}(\hat{\boldsymbol{u}})=g^{(t)}(\hat{\boldsymbol{v}})$ and for $t\leq l+1$ $h^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=h^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=g^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})$ . We prove that for $t=l+1$ , $h^{(l+1)}(\hat{\boldsymbol{u}})=h^{(l+1)}(\hat{\boldsymbol{v}})\Longrightarrow g^{(l+1)}(\hat{\boldsymbol{u}})=g^{(l+1)}(\hat{\boldsymbol{v}})$ .

We first rewrite Eq.33 as

	$\displaystyle h^{(t+1)}(\hat{\boldsymbol{v}}):=h^{(t+1,m-1)}(\hat{\boldsymbol{v}})=\text{MLP}\Big{(}h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}(h^{{(t+1,m-2)}}(\hat{\boldsymbol{w}}))\Big{)}$		(36)
	$\displaystyle=\text{MLP}\Bigg{(}h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}}),\sum_{\hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})}\text{MLP}\Big{(}\text{MLP}\big{(}h^{(t+\frac{1}{2},k-m+1)}(\hat{\boldsymbol{w}}),\sum_{\hat{\boldsymbol{p}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{w}})}\text{MLP}(h^{{(t+1,m-3)}}(\hat{\boldsymbol{p}}))\big{)}\Big{)}\Bigg{)}$		(37)

Then $h^{(l+1)}(\hat{\boldsymbol{u}})=h^{(l+1)}(\hat{\boldsymbol{v}})$ implies

1) $h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{v}})=h^{(t+\frac{1}{2},k-m)}(\hat{\boldsymbol{u}})$ or equivalently $h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}})=h^{(t+\frac{1}{2})}(\hat{\boldsymbol{u}})$
2) $\{\mskip-5.0mu\{h^{(t+\frac{1}{2})}(\hat{\boldsymbol{w}})\ |\ \hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t+\frac{1}{2})}(\hat{\boldsymbol{w}})\ |\ \hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{u}})\}\mskip-5.0mu\}$

Also by Eq.32 we know that $h^{(t+\frac{1}{2})}(\hat{\boldsymbol{v}})=h^{(t+\frac{1}{2})}(\hat{\boldsymbol{u}})$ implies

3) $h^{(t)}(\hat{\boldsymbol{v}})=h^{(t)}(\hat{\boldsymbol{u}})$ .

Combining with 2) and 3) we know

4) $\{\mskip-5.0mu\{h^{(t)}(\hat{\boldsymbol{w}})\ |\ \hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{v}})\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{h^{(t)}(\hat{\boldsymbol{w}})\ |\ \hat{\boldsymbol{w}}\in\mathcal{N}^{G}_{\text{left}}(\hat{\boldsymbol{u}})\}\mskip-5.0mu\}$

Now using our inductive hypothesis, we can transform 1) 2) 3) 4) by replace $h$ with $g$ . Then based on the equation in 11, we know $g^{(l+1)}(\hat{\boldsymbol{u}})=g^{(l+1)}(\hat{\boldsymbol{v}})$ .

Combining the two proved inductive hypothesis and applying them alternately, we know that for any $t$ , $\hat{\boldsymbol{v}}$ , $\hat{\boldsymbol{u}}$ , $h^{(t)}(\hat{\boldsymbol{u}})=h^{(t)}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t)}(\hat{\boldsymbol{u}})=g^{(t)}(\hat{\boldsymbol{v}})$ and $h^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=h^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})\Longrightarrow g^{(t-\frac{1}{2})}(\hat{\boldsymbol{u}})=g^{(t-\frac{1}{2})}(\hat{\boldsymbol{v}})$ . Hence $t$ -layer $(k,c)(\leq)$ -SetGNN^∗ is more expressive than $t$ -layer $(k,c)(\leq)$ -SetGNN.

Now let’s considering $t$ to infinity. Then all representations will become stable with $h^{(t)}(\hat{\boldsymbol{v}})=h^{(t)}(\hat{\boldsymbol{u}})\Longleftrightarrow h^{(t+1)}(\hat{\boldsymbol{v}})=h^{(t+1)}(\hat{\boldsymbol{u}})$ and $h^{(t-1/2)}(\hat{\boldsymbol{v}})=h^{(t-1/2)}(\hat{\boldsymbol{u}})\Longleftrightarrow h^{(t+1/2)}(\hat{\boldsymbol{v}})=h^{(t+1/2)}(\hat{\boldsymbol{u}})$ . Notice that for a single set $\hat{\boldsymbol{v}}$ , the information used from its neighbors are the same in both $(k,c)(\leq)$ -SetGNN and $(k,c)(\leq)$ -SetGNN^∗. Hence the equilibrium equations for set $\hat{\boldsymbol{v}}$ should be the same in both $(k,c)(\leq)$ -SetGNN and $(k,c)(\leq)$ -SetGNN^∗. Then they will have the same stable representations at the end.

∎

A.4.9 Proof of Theorem 8

Theorem 8.

Proof.

Right $\Longrightarrow$ Left. Let $h_{i}:V(C_{i})\rightarrow V(C^{\prime}_{p(i)})$ be one isomorphism from $C_{i}$ to $C^{\prime}_{p(i)}$ for $i\in[c]$ . Then we can create a new mapping $h:V(G)\rightarrow V(G^{\prime})$ , such that for any $x\in V(C)$ , it first locates which component $x$ inside, for example $i$ , then apply $h_{i}$ to $x$ . Clearly $h$ is a isomorphism between $G$ and $G^{\prime}$ , hence $G$ and $G^{\prime}$ are isomorphic.

Left $\Longrightarrow$ Right. $G$ and $G^{\prime}$ are isomorphic, then there exists an isomorphism $h$ from $G$ to $G^{\prime}$ . Now for a specific component $C_{i}$ in $G$ , $h$ maps $V(C_{i})$ to a nodes set inside $V(G^{\prime})$ , and name it as $B_{i}$ . For $i\not=j$ , $B_{i}\cap B_{j}=\emptyset$ and there is not edges $e\in E(G^{\prime})$ between $B_{i}$ and $B_{j}$ otherwise there must be an corresponding edge (apply $h^{-1}$ ) between $V(C_{i})$ and $V(C_{j})$ which is impossible. Hence $\{G^{\prime}[B_{i}]\}_{i=1}^{c}$ are disconnected subgraphs. As $h$ also preserves connections, for any $i$ , $C_{i}$ and $G^{\prime}[B_{i}]$ are isomorphic. Hence we proved the right side.

∎

A.4.10 Conjecture that k-MWL is equvialent to k-WL

Proof of $\forall t,\bm{wl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})\Longleftarrow\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$

Proof.

We first take a closer look at the condition of $\bm{mwl}_{k}^{(t+1)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t+1)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ . By Eq.(3) we know this is equivalent to
(1) $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ and (2) $\bigg{\{}\mskip-10.0mu\bigg{\{}\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/o^{-1}_{G}(\tilde{\boldsymbol{v}},i)])\big{|}x\in V(G)\}\mskip-5.0mu\}\ |\ i=1,...,k\bigg{\}}\mskip-10.0mu\bigg{\}}$ = $\bigg{\{}\mskip-10.0mu\bigg{\{}\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/o^{-1}_{G}(\tilde{\boldsymbol{v}}^{\prime},i)])\big{|}x\in V(G^{\prime})\}\mskip-5.0mu\}\ |\ i=1,...,k\bigg{\}}\mskip-10.0mu\bigg{\}}$ . (2) can be rewrite as $\bigg{\{}\mskip-10.0mu\bigg{\{}\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in V(G)\}\mskip-5.0mu\}\ |\ y\in\tilde{\boldsymbol{v}}\bigg{\}}\mskip-10.0mu\bigg{\}}$ = $\bigg{\{}\mskip-10.0mu\bigg{\{}\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(y)])\big{|}x\in V(G^{\prime})\}\mskip-5.0mu\}\ |\ y\in\tilde{\boldsymbol{v}}^{\prime}\bigg{\}}\mskip-10.0mu\bigg{\}}$ , which is equivalent to (3) $\exists\text{ bijective }f:V(G)\rightarrow V(G^{\prime})$ , $\forall y\in\tilde{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])\big{|}x\in V(G)\}\mskip-5.0mu\}=\{\mskip-5.0mu\{\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[x/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(f(y))])\big{|}x\in V(G^{\prime})\}\mskip-5.0mu\}$ .

Lemma 5.

$\forall t$ , $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ if and only if $F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})\not=\emptyset$ .

Notice that this Lemma is conjectured to be true. This needs additional proof.

Lemma 6.

$\forall t,\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})\Longrightarrow\forall f\in F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}),\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G,f(\overrightarrow{\boldsymbol{v}}))$

Proof of Lemma.6: We prove it by induction on $t$ . When $t=0$ , $F^{(0)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ contains all isomorphisms between $G[\tilde{\boldsymbol{v}}]$ and $G^{\prime}[\tilde{\boldsymbol{v}}^{\prime}]$ , hence the right side is correct. Assume the statement is correct for $\leq t$ . For $t+1$ case, the left side implies (1) $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ and (2) $F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})\not=\emptyset$ , $\forall f\in F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ , $\forall y\in\tilde{\boldsymbol{v}}$ , $\exists h_{y}:V(G)\rightarrow V(G^{\prime})$ , $\forall x\in V(G)$ , $\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)])=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime}[h_{y}(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(f(y))])$ . By induction hypothesis, (1) and $F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})\subseteq F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ imply (a) $\forall f\in F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime}),\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G^{\prime},f(\overrightarrow{\boldsymbol{v}}))$ ; (2) and $\forall y\in\tilde{\boldsymbol{v}}$ , $f\in F^{(t)}(G,G^{\prime},\tilde{\boldsymbol{v}}[x/\text{idx}_{\tilde{\boldsymbol{v}}}(y)],\tilde{\boldsymbol{v}}^{\prime}[h_{y}(x)/\text{idx}_{\tilde{\boldsymbol{v}}^{\prime}}(f(y))])$ imply (b) $\exists h$ , $\forall x\in V(G)$ , $\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/\text{idx}_{\overrightarrow{\boldsymbol{v}}}(y)])=\bm{wl}_{k}^{(t)}(G^{\prime},f(\overrightarrow{\boldsymbol{v}})[h(x)/\text{idx}_{f(\overrightarrow{\boldsymbol{v}})}(f(y))])$ , which can be rewritten as $\forall y\in\tilde{\boldsymbol{v}}$ , $\{\mskip-5.0mu\{\bm{wl}_{k}^{(t)}(G,\overrightarrow{\boldsymbol{v}}[x/\text{idx}_{\overrightarrow{\boldsymbol{v}}}(y)]\ |\ x\in V(G))\}\mskip-5.0mu\}$ = $\{\mskip-5.0mu\{\bm{wl}_{k}^{(t)}(G^{\prime},f(\overrightarrow{\boldsymbol{v}})[x/\text{idx}_{f(\overrightarrow{\boldsymbol{v}})}(f(y))]\ |\ x\in V(G))\}\mskip-5.0mu\}$ . Now applying Eq.(1) with (a) and (b), we can derive that $\forall f\in F^{(t+1)}(G,G^{\prime},\tilde{\boldsymbol{v}},\tilde{\boldsymbol{v}}^{\prime})$ , $\bm{wl}_{k}^{(t+1)}(G,\overrightarrow{\boldsymbol{v}})=\bm{wl}_{k}^{(t+1)}(G,f(\overrightarrow{\boldsymbol{v}}))$ . Hence the statement is correct for any $t$ .

Using Lemma.6 and the conclusion ( $\forall p\in\text{perm[k]}$ , $\bm{wl}_{k}^{(t+1)}(G,p(\overrightarrow{\boldsymbol{v}}))=\bm{wl}_{k}^{(t+1)}(G^{\prime},p(f(\overrightarrow{\boldsymbol{v}})))$ ) proved in the first part of the proof of Theorem.1, we proved $\forall t,\bm{mwl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{mwl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})\Longrightarrow\bm{wl}_{k}^{(t)}(G,\tilde{\boldsymbol{v}})=\bm{wl}_{k}^{(t)}(G^{\prime},\tilde{\boldsymbol{v}}^{\prime})$ . ∎

A.5 Algorithm of extracting $(k,c)(\leq)$ sets and constructing the super-graph of $(k,c)(\leq)$ -SetWL

Here we present the algorithm of constructing supernodes $(t+1,c)(\leq)$ sets from supernodes $(t,c)(\leq)$ sets, and constructing the bipartite graph between $(t,c)(\leq)$ sets and $(t+1,c)(\leq)$ sets. Notice the algorithm we presented is just the pseudo code and we use for-loops to help presentation. The algorithm can easily be parallelized (removing for-loops) with using matrix operations.

Algorithm 1 Constructing supernodes

(t+1,c)(\leq)

sets and the bipartite graph between

t

and

t+1

Input graph

G

, the list

O_{t,\leq c}

containing all supernodes with

t

nodes and

\leq c

number of components, the list

N_{t,\leq c}

with

N_{t,\leq c}[i]

be the number of components of

G[O_{t,\leq c}[i]]

O_{t+1,\leq c}

N_{t+1,\leq c}

B_{t}

containing edges between

O_{t,\leq c}

and

O_{t+1,\leq c}

O_{t+1,\leq c},N_{t+1,\leq c},B_{t}\leftarrow[],[],[]

for all

\hat{\boldsymbol{s}},n

in zip(

O_{t,\leq c},N_{t,\leq c}

) do

\mathcal{N}_{1}\leftarrow

1

-st hop neighbors of

G[\hat{\boldsymbol{s}}]

\mathcal{N}_{>1}\leftarrow

(V(G)\setminus\hat{\boldsymbol{s}})\setminus\mathcal{N}_{1}

for all

m

\mathcal{N}_{1}

\triangleright

Number of components doesn’t change.

\hat{\boldsymbol{q}}\leftarrow\hat{\boldsymbol{s}}+\{m\}

Append

\hat{\boldsymbol{q}}

into

O_{t+1,\leq c}

n

into

N_{t+1,\leq c}

Append edge

(\hat{\boldsymbol{s}},\hat{\boldsymbol{q}})

into

B_{t}

\triangleright

We change the edge to

(\hat{\boldsymbol{s}},\hat{\boldsymbol{q}},m)

for Algorithm.2

end for

if n < c then

\triangleright

Creating an additional component.

for all

m

\mathcal{N}_{>1}

\hat{\boldsymbol{q}}\leftarrow\hat{\boldsymbol{s}}+\{m\}

Append

\hat{\boldsymbol{q}}

into

O_{t+1,\leq c}

n+1

into

N_{t+1,\leq c}

Append edge

(\hat{\boldsymbol{s}},\hat{\boldsymbol{q}})

into

B_{t}

\triangleright

We change the edge to

(\hat{\boldsymbol{s}},\hat{\boldsymbol{q}},m)

for Algorithm.2

end for

end if

end for

Remove repeated elements inside

O_{t+1,\leq c}

, with corresponding mask

M

Apply mask

M

N_{t+1,\leq c}

return

O_{t+1,\leq c}

N_{t+1,\leq c}

B_{t}

To get the full supergraph $S_{k,c\text{-swl}}$ , we need to apply Algorithm.1 $k$ $-$ $1$ times sequentially.

A.6 Algorithm of connecting supernodes to their connected components

In Sec.4.3 we proved that for a supernode with multi-component induced subgraph (call it multi-component supernode), the color initialization can be greatly sped up given knowing its each component’s corresponding supernode. Hence we need to build the bipartite graph (call it components graph) between single-component supernodes and multi-component supernodes. We present the algorithm of constructing these kind of connections sequentially. That is, given knowing the connections between single-component supernodes and all multi-component supernodes with $\leq t$ number of nodes, to build the connections between single-component supernodes and all multi-component supernodes with $\leq t$ $+$ $1$ number of nodes. To build the full bipartite graph for $(k,c)(\leq)$ sets, we need to conduct the algorithm $k$ $-$ $1$ times sequentially.

Algorithm 2 Constructing the bipartite graph between all single-component supernodes and all multi-component supernodes with

\leq t

+

1

number of nodes)

\{B_{i}\}_{i=1}^{t}

\{O_{i,\leq c}\}_{i=1}^{t+1}

\{N_{i,\leq c}\}_{i=1}^{t+1}

from Algorithm.1 (with blue lines applied), dictionary

D_{\leq t}

with each key being a multi-component supernode with

\leq t

nodes, and value being a list of its corresponding single-component supernodes.

D_{\leq t+1}

D_{\leq t+1}\leftarrow D_{\leq t}

for all

\hat{\boldsymbol{q}}

O_{t+1,\leq c}

Get all edges

T=\{(\hat{\boldsymbol{s}}_{i},\hat{\boldsymbol{q}}),m_{i}\}

B_{t}

with

\hat{\boldsymbol{q}}

inside, and let

j\leftarrow\arg\max_{i}N_{t,\leq c}[\hat{\boldsymbol{s}}_{i}]

N_{t,\leq c}[\hat{\boldsymbol{s}}_{j}]==N_{t+1,\leq c}[\hat{\boldsymbol{q}}]

then

\triangleright

No increasing of components

D_{\leq t+1}[\hat{\boldsymbol{q}}]\leftarrow[]

for all

\hat{\boldsymbol{p}}

D_{\leq t}[\hat{\boldsymbol{s}}_{j}]

Assume

k=|\hat{\boldsymbol{p}}|

, i.e. the number of nodes inside.

Traverse

B_{k}

to find the supernode

(\hat{\boldsymbol{p}},m_{j})

if Found it and

N_{k+1,\leq c}[(\hat{\boldsymbol{p}},m_{j})]==1

then

Append

(\hat{\boldsymbol{p}},m_{j})

D_{\leq t+1}[\hat{\boldsymbol{q}}]

else

Append

\hat{\boldsymbol{p}}

D_{\leq t+1}[\hat{\boldsymbol{q}}]

end if

end for

else

\triangleright

N_{t,\leq c}[\hat{\boldsymbol{s}}_{j}]=N_{t+1,\leq c}[\hat{\boldsymbol{q}}]-1

D_{\leq t+1}[\hat{\boldsymbol{q}}]\leftarrow D_{\leq t}[\hat{\boldsymbol{s}}_{j}]

[m_{j}]

end if

end for

return

D_{\leq t+1}

For the clear of presentation we use for-loops inside Algorithm.2, while in practice we use matrix operations to eliminate for-loops.

A.7 Dataset Details

Table 6: Dataset statistics.

Dataset	Task	# Cls./Tasks	# Graphs	Avg. # Nodes	Avg. # Edges
EXP [1]	Distinguish 1-WL failed graphs	2	1200	44.4	110.2
SR25 [6]	Distinguish 3-WL failed graphs	15	15	25	300
CountingSub. [15]	Regress num. of substructures	4	1500 / 1000 / 2500	18.8	62.6
GraphProp. [17]	Regress global graph properties	3	5120 / 640 / 1280	19.5	101.1
ZINC-12K [19]	Regress molecular property	1	10000 / 1000 / 1000	23.1	49.8
QM9 [59]	Regress molecular properties	19²²2We use the version of the dataset from PyG [20], and it contains 19 tasks.	130831	18.0	37.3

A.8 Experimental Setup

Due to limited time and resource, we highly restrict the hyperparameters and fix most of hyperparameters the same across all models and baselines to ensure a fair comparison. This means the performance of $(k,c)(\leq)$ -SetGNN^∗ reported in the paper is not the best performance given that we didn’t tune much hyperparameters. Nevertheless the performance still reflects the theory and designs proposed in the paper, and we postpone studying the SOTA performance of $(k,c)(\leq)$ -SetGNN^∗ to future work. To be clear, we fix batch size to 128, the hidden size to 128, the number of layers of Base GNN to 4, and the number of layers of $(k,c)(\leq)$ -SetGNN^∗ (the number of iterations of $(k,c)(\leq)$ -SetWL) to be 2 (we will do ablation study over it later). This hyperparameters configuration is used for all datasets. We have run the baseline GINE over many datasets, and we tune the number of layers from [4,6] and keep other hyperparameters the same as $(k,c)(\leq)$ -SetGNN^∗. For all other baselines, we took the performance reported in [68].

For all datasets except QM9, we follow the same configuration used in [68]. For QM9, we use the dataset from PyG and conduct regression over all 19 targets simultaneously. To balance the scale of each target, we preprocess the dataset by standardizing every target to a Gaussian distribution with mean 0 and standard derivation 1. The dataset is randomly split with ratio 80%/10%/10% to train/validation/test sets (with a fixed random state so that all runs and models use the same split). For every graph in QM9, it contains 3d positional coordinates for all nodes, and we use them to augment edge features by using the absolute difference between the coordinates of two nodes on an edge for all models. Notice that our goal is not to achieve SOTA performance on QM9 but mainly to verify our theory and effectiveness of designs.

We use Batch Normalization and ReLU activation in all models. We use Adam optimizer with learning rate 0.001 in all experiments for optimization. We repeat all experiments three times (for random initialization) to calculate mean and standard derivation. All experiments are conducted on V100 and RTX-A6000 GPUs.

A.9 Graph Property Dataset Results

Table 7: Train and Test performances of

(k,c)(\leq)

-SetGNN^∗ on regressing graph properties by varying

k

and

c

		Regressing Graph Properties ( $\log_{10}$ (MSE))
		Is Connected		Diameter		Radius
k	c	Train	Test	Train	Test	Train	Test
2	1	-4.2266 ± 0.1222	-2.9577 ± 0.1295	-4.0347 ± 0.0468	-3.6322 ± 0.0458	-4.4690 ± 0.0348	-4.9436 ± 0.0277
3	1	-4.2360 ± 0.1854	-3.4631 ± 0.6392	-4.0228 ± 0.1256	-3.7885 ± 0.0589	-4.4762 ± 0.1176	-5.0245 ± 0.0881
4	1	-4.7776 ± 0.0386	-4.9941 ± 0.0913	-4.1396 ± 0.0442	-4.0122 ± 0.0071	-4.2837 ± 0.5880	-4.1528 ± 0.9383
2	2	-4.6623 ± 0.3170	-4.7848 ± 0.3150	-4.0802 ± 0.1654	-3.8962 ± 0.0124	-4.5362 ± 0.2012	-5.1603 ± 0.0610
3	2	-4.2601 ± 0.3192	-4.4547 ± 1.1715	-4.3235 ± 0.3050	-3.9905 ± 0.0799	-4.6766 ± 0.1797	-4.9836 ± 0.0658
4	2	-4.8489 ± 0.1354	-5.4667 ± 0.2125	-4.5033 ± 0.1610	-3.9495 ± 0.3202	-4.4130 ± 0.2686	-4.1432 ± 0.4405

A.10 Ablation Study over Number of Layers

Table 8: Ablation study of

(k,c)(\leq)

-SetGNN^∗ over number of bidirectional message passing layers (L) on ZINC dataset.

L		2		4		6
k	c	Train	Test	Train	Test	Train	Test
2	1	0.1381 ± 0.0240	0.2345 ± 0.0131	0.1135 ± 0.0418	0.1921 ± 0.0133	0.0712 ± 0.0015	0.1729 ± 0.0134
3	1	0.1172 ± 0.0063	0.2252 ± 0.0030	0.0792 ± 0.0190	0.1657 ± 0.0035	0.0692 ± 0.0118	0.1679 ± 0.0061
4	1	0.0693 ± 0.0111	0.1636 ± 0.0052	0.0700 ± 0.0085	0.1566 ± 0.0101	0.0768 ± 0.0116	0.1572 ± 0.0051

The table shows that increasing the number of bidirectional message passing ( $t$ in Eq.(13) and Eq.(14) ) always increase the performance, which is aligning with the fact that increasing number of layers always increases expressivity.

A.11 Computational Footprint on QM9

A.12 Discussion of k-Bipartite Message Passing

Interestingly, the k-bipartite bidirectional message passing is very general and its modification covers many well known GNNs. Considering only using 1-sets and 2-sets with single connected components (the red frame inside Figure 6), and initializing their embeddings with their original nodes and edges representations, then it covers the follows

•

Message passing based GNNs [25]. By using sequential message passing defined in Eq.(13) and Eq.(14), and performing forward step first then backward step for all 1-sets.
•

Line graph based GNNs [14, 16]. By using sequential message passing defined in Eq.(13) and Eq.(14), and performing backward step first then forward step for all 2-sets.
•

Relational Graph Networks [7] or edge-enhanced GNN [16]. By using performing bidirectional sequential message passing on all 1-sets and 2-sets.

A Practical, Progressively-Expressive GNN

Abstract

1 Introduction

2 Related Work

3 A practical progressively-expressive isomorphism test: (k,c)​(≤)(k,c)(\leq)-SetWL

3.1 Preliminaries: the kk-Weisfeiler-Leman (kk-WL) Graph Isomorphism Test

3.2 From kk-WL to kk-MultisetWL: Removing Ordering

Theorem 1.

Theorem 2.

Theorem 3.

3.3 From kk-MultisetWL to k​(≤)k(\leq)-SetWL: Removing Repetition

Theorem 4.

3.4 From k​(≤)k(\leq)-SetWL to (k,c)​(≤)(k,c)(\leq)-SetWL: Accounting for Sparsity

Theorem 5.

3.5 Complexity Analysis

3.6 Set version of kk-FWL

4 A practical progressively-expressive GNN: (k,c)​(≤)(k,c)(\leq)-SetGNN

4.1 From (k,c)​(≤)(k,c)(\leq)-SetWL to (k,c)​(≤)(k,c)(\leq)-SetGNN

Theorem 6.

Corollary 6.1.

4.2 Bidirectional Sequential Message Passing

Theorem 7.

4.3 Improving Supernode Initialization

Theorem 8.

5 Experiments

5.1 Setup

5.2 Results

5.3 Computational requirements

6 Conclusion

References

Appendix A Appendix

A.1 Visualization Supergraph Connection of 2-SetWL

A.2 Bijective kk-Pebble Game for kk-WL

Theorem 9.

Theorem 10.

A.3 Doubly Bijective kk-Pebble Game for kk-MultisetWL

Theorem 3.

Proof.

Theorem 11.

Proof.

A.4 Proofs of Theorems

A.4.1 Bound of Summation of Binomial Coefficients

Proof.

A.4.2 Proof of Theorem 1

Theorem 1.

Proof.

A.4.3 Proof of Theorem 2

Proof.

Lemma 1.

Theorem 2.

Proof.

A.4.4 Proof of Theorem 3

A.4.5 Proof of Theorem 4

Theorem 4.

Proof.

Lemma 2.

A.4.6 Proof of Theorem 5

Theorem 5.

Proof.

Lemma 3.

Proof.

A.4.7 Proof of Theorem 6

Theorem 6.

Proof.

Lemma 4 (Lemma 5 of [61]).

A.4.8 Proof of Theorem 7

Theorem 7.

Proof.

A.4.9 Proof of Theorem 8

Theorem 8.

Proof.

A.4.10 Conjecture that k-MWL is equvialent to k-WL

Proof.

Lemma 5.

Lemma 6.

A.5 Algorithm of extracting (k,c)​(≤)(k,c)(\leq) sets and constructing the super-graph of (k,c)​(≤)(k,c)(\leq)-SetWL

A.6 Algorithm of connecting supernodes to their connected components

A.7 Dataset Details

A.8 Experimental Setup

A.9 Graph Property Dataset Results

3 A practical progressively-expressive isomorphism test: $(k,c)(\leq)$ -SetWL

3.1 Preliminaries: the $k$ -Weisfeiler-Leman ( $k$ -WL) Graph Isomorphism Test

3.2 From $k$ -WL to $k$ -MultisetWL: Removing Ordering

3.3 From $k$ -MultisetWL to $k(\leq)$ -SetWL: Removing Repetition

3.4 From $k(\leq)$ -SetWL to $(k,c)(\leq)$ -SetWL: Accounting for Sparsity

3.6 Set version of $k$ -FWL

4 A practical progressively-expressive GNN: $(k,c)(\leq)$ -SetGNN

4.1 From $(k,c)(\leq)$ -SetWL to $(k,c)(\leq)$ -SetGNN

A.2 Bijective $k$ -Pebble Game for $k$ -WL

A.3 Doubly Bijective $k$ -Pebble Game for $k$ -MultisetWL

A.5 Algorithm of extracting $(k,c)(\leq)$ sets and constructing the super-graph of $(k,c)(\leq)$ -SetWL