Graph Neural Networks with Local Graph Parameters
Abstract
Various recent proposals increase the distinguishing power of Graph Neural Networks by propagating features between -tuples of vertices. The distinguishing power of these “higher-order” is known to be bounded by the -dimensional Weisfeiler-Leman () test, yet their memory requirements limit their applicability. Other proposals infuse with local higher-order graph structural information from the start, hereby inheriting the desirable memory requirement from at the cost of a one-time, possibly non-linear, preprocessing step. We propose local graph parameter enabled as a framework for studying the latter kind of approaches and precisely characterize their distinguishing power, in terms of a variant of the test, and in terms of the graph structural properties that they can take into account. Local graph parameters can be added to any architecture, and are cheap to compute. In terms of expressive power, our proposal lies in the middle of and their higher-order counterparts. Further, we propose several techniques to aide in choosing the right local graph parameters. Our results connect with deep results in finite model theory and finite variable logics. Our experimental evaluation shows that adding local graph parameters often has a positive effect for a variety of , datasets and graph learning tasks.
1 Introduction
Context.
Graph neural networks () (Merkwirth & Lengauer, 2005; Scarselli et al., 2009), and its important class of Message Passing Neural Networks () (Gilmer et al., 2017), are one of the most popular methods for graph learning tasks. Such use an iterative message passing scheme, based on the adjacency structure of the underlying graph, to compute vertex (and graph) embeddings in some real Euclidean space. The expressive (or discriminative) power of is, however, rather limited (Xu et al., 2019; Morris et al., 2019). Indeed, will always identically embed two vertices (graphs) when these vertices (graphs) cannot be distinguished by the one-dimensional Weisfeiler-Leman () algorithm. Two graphs and and vertices and that cannot be distinguished by (and thus any ) are shown in Fig. 1. The expressive power of is well-understood (Cai et al., 1992; Arvind et al., 2020; Dell et al., 2018) and basically can only use tree-based structural information in the graphs to distinguish vertices. As a consequence, no can detect that vertex in Fig. 1 is part of a -clique, whereas is not. Similarly, cannot detect that is part of a -cycle, whereas is not. Further limitations of in terms of graph properties can be found, e.g., in Fürer (2017); Arvind et al. (2020); Chen et al. (2020); Tahmasebi & Jegelka (2020).
To remedy the weak expressive power of , so-called higher-order were proposed (Morris et al., 2019; Maron et al., 2019b; Morris et al., 2020), whose expressive power is measured in terms of the -dimensional procedures () (Maron et al., 2019a; Chen et al., 2019a; Azizian & Lelarge, 2021; Geerts, 2020; Sato, 2020; Damke et al., 2020). In a nutshell, operates on -tuples of vertices and allows to distinguish vertices (graphs) based on structural information related to graphs of treewidth (Dvorak, 2010; Dell et al., 2018). By definition, . As an example, can detect that vertex in Fig. 1 belongs to a -clique or a -cycle since both have treewidth two. While more expressive than , the based on require operations in each iteration, where is the number of vertices, hereby hampering their applicability.
A more practical approach is to extend the expressive power of whilst preserving their cost in each iteration. Various such extensions (Kipf & Welling, 2017; Chen et al., 2019a; Li et al., 2019; Ishiguro et al., 2020; Bouritsas et al., 2020; Geerts et al., 2020) achieve this by infusing with local graph structural information from the start. That is, the iterative message passing scheme of is run on vertex labels that contain quantitative information about local graph structures. It is easy to see that such architectures can go beyond the test: for example, adding triangle counts to suffices to distinguish the vertices and and graphs and in Fig. 1. Moreover, the cost is a single preprocessing step to count local graph parameters, thus maintaining the cost in the iterations of the . While there are some partial results showing that local graph parameters increase expressive power (Bouritsas et al., 2020; Li et al., 2019), their precise expressive power and relationship to higher-order was unknown, and there is little guidance in terms of which local parameters do help and which ones do not. The main contribution of this paper is a precise characterization of the expressive power of with local graph parameters and its relationship to the hierarchy of higher-order .
Our contributions.
In order to nicely formalize local graph parameters, we propose to extend vertex labels with homomorphism counts of small graph patterns.111We recall that homomorphisms are edge-preserving mappings between the vertex sets. More precisely, given a graphs and , and vertices in and in , we add the number of homomorphisms from to that map to , denoted by , to the initial features of . Such counts satisfy conditions (i) and (ii). Indeed, homomorphism counts are known to measure the similarity of vertices and graphs (Lovász, 1967; Grohe, 2020a), and serve as a basis for the efficient computation of a number of other important graph parameters, e.g., subgraph and induced subgraphs counts (Curticapean et al., 2017; Zhang et al., 2020). Furthermore, homomorphism counts underly characterisations of the expressive power of and higher-order . As an example, two vertices and in graphs and , respectively, are indistinguishable by , and hence by , precisely when for every rooted tree (Dvorak, 2010; Dell et al., 2018).
Concretely, we propose - where is a set of (graph) patterns, by (i) first allowing a pre-processing step that labels each vertex of a graph with the vector , and (ii) then run an on this labelling. As such, we can turn any into an - by simply augmenting the initial vertex embedding. Furthermore, several recently proposed extensions of fit in this approach, including extended with information about vertex degrees (Kipf & Welling, 2017), walk counts (Chen et al., 2019a), tree-based counts (Ishiguro et al., 2020) and subgraph counts (Bouritsas et al., 2020). Hence, - can also be regarded as a unifying theoretical formalism.
Our main results can be summarised, as follows:
-
1.
We precisely characterise the expressive power of - by means of an extension of , denoted by -. For doing so, we use -pattern trees, which are obtained from standard trees by joining an arbitrary number of copies of the patterns in to each one of its vertices. Our result states that vertices and in graphs and , respectively, are indistinguishable by -, and hence by -, precisely when for every -pattern tree . This characterisation gracefully extends the characterisation for standard , mentioned earlier, by setting . Furthermore, - provide insights in the expressive power of existing extensions, most notably the Graph Structure Networks of Bouritsas et al. (2020).
-
2.
We compare - to higher-order , which are characterized in terms of the -test. On the one hand, while - strictly increase the expressive power of the -test, for any finite set of patterns, can distinguish graphs which - cannot. On the other hand, for each there are patterns such that - can distinguish graphs which cannot.
-
3.
We deal with the technically challenging problem of pattern selection and comparing - based on the patterns included in . We prove two partial results: one establishing when a pattern in is redundant, based on whether or not is the join of other patterns in , and another result indicating when does add expressive power, based on the treewidth of compared to the treewidth of other patterns in .
-
4.
Our theoretical results are complemented by an experimental study in which we show that for various architectures, datasets and graph learning tasks, all part of the recent benchmark by Dwivedi et al. (2020), the augmentation of initial features with homomorphism counts of graph patterns has often a positive effect, and the cost for computing these counts incurs little to no overhead.
As such, we believe that - not only provide an elegant theoretical framework for understanding local graph parameter enabled , they are also a valuable alternative to higher-order as way to increase the expressive power of . In addition, and as will be explained in Section 2, - provide a unifying framework for understanding the expressive power of several other existing extensions of . Proofs of our results and further details on the relationship to existing approaches and experiments can be found in the appendix.
Related Work.
Works related to the distinguishing power of the -test, and their higher-order variants are cited throughout the paper. Beyond distinguishability, are analyzed in terms of universality and generalization properties (Maron et al., 2019c; Keriven & Peyré, 2019; Chen et al., 2019b; Garg et al., 2020), local distributed algorithms (Sato et al., 2019; Loukas, 2020), randomness in features (Sato et al., 2020; Abboud et al., 2020) and using local context matrix features (Vignac et al., 2020). Other extensions of are surveyed, e.g., in Wu et al. (2021); Zhou et al. (2018) and Chami et al. (2021). Related are the Graph Homomorphism Convolutions by NT & Maehara (2020) which apply s directly on the representation of vertices by homomorphism counts. Finally, our approach is reminiscent of the graph representations by means of graphlet kernels (Shervashidze et al., 2009), but then on the level of vertices.
2 Local Graph Parameter Enabled
In this section we introduce with local graph parameters. We start by introducing preliminary concepts.
Graphs.
We consider undirected vertex-labelled graphs , with the set of vertices, the set of edges and a mapping assigning a label to each vertex in . The set of neighbours of a vertex is denoted as . A rooted graph is a graph in which one its vertices is declared as its root. We denote a rooted graph by , where is the root and depict them as graphs in which the root is a blackened vertex. Given graphs and , an homomorphism is a mapping such that (i) for every , and (ii) for every . For rooted graphs and , an homomorphism must additionally map to . We denote by the number of homomorphisms from to ; similarly for rooted graphs. For simplicity of exposition we focus on vertex-labelled undirected graphs but all our results can be extended to edge-labelled directed graphs.
Message passing neural networks.
The basic architecture for (Gilmer et al., 2017), and the one used in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019; Barceló et al., 2020), consists of a sequence of rounds that update the feature vector of every vertex in the graph by combining its current feature vector with the result of an aggregation over the feature vectors of its neighbours. Formally, for a graph , let denote the feature vector computed for vertex by an in round . The initial feature vector is a one-hot encoding of its label . This feature vector is iteratively updated in a number of rounds. In particular, in round ,
where and are an aggregating and update function, respectively. Thus, the feature vectors of all neighbours of are combined by the aggregating function into a single vector, and then this vector is used together with in order to produce by applying the update function .
with local graph parameters.
The studied in this paper leverage the power of by enhancing initial features of vertices with local graph parameters that are beyond their classification power. To illustrate the idea, consider the graphs in Fig. 1. As mentioned, these graphs cannot be distinguished by the -test, and therefore cannot be distinguished by the broad class of (see e.g. (Xu et al., 2019; Morris et al., 2019)). If we allow a pre-processing stage, however, in which the initial labelling of every vertex is extended with the number of (homomorphic images of) -cliques in which participates (indicated by numbers between brackets in Fig. 1), then clearly vertices and (and the graphs and ) can be distinguished based on this extra structural information. In fact, the initial labelling already suffices for this purpose.
Let be a set of (rooted) graphs, which we refer to as patterns. Then, -enabled , or just -, are defined in the same way as with the crucial difference that now the initial feature vector of a vertex is a one-hot encoding of the label of the vertex, and all the homomorphism counts from patterns in . Formally, in each round an - labels each vertex in graph with a feature vector which is inductively defined as follows:
We note that standard are - with . As for , we can equip - with a Readout function that aggregates all final feature vectors into a single feature vector in order to classify or distinguish graphs.
We emphasise that any architecture can be turned in an - by a simple homomorphism counting preprocessing step. As such, we propose a generic plug-in for a large class of architectures. Better still, homomorphism counts of small graph patterns can be efficiently computed even on large datasets (Zhang et al., 2020) and they form the basis for counting (induced) subgraphs and other notions of subgraphs (Curticapean et al., 2017). Despite the simplicity of -, we will show that they can substantially increase the power of by varying , only paying the one-time cost for preprocessing.
- as unifying framework.
An important aspect of - is that they allow a principled analysis of the power of existing extensions of . For example, taking suffices to capture degree-aware (Geerts et al., 2020), such as the Graph Convolution Networks (s) (Kipf & Welling, 2017), which use the degree of vertices; taking for rooted paths of length suffices to model the walk counts used in Chen et al. (2019a); and taking as the set of labeled trees of depth one precisely corresponds to the use of the -labelling obtained after one round by Ishiguro et al. (2020). Furthermore, -, where denotes the cycle of length , correspond to the extension proposed in Section 4 in Li et al. (2019).
In addition, - can also capture the Graph Structure Networks () by Bouritsas et al. (2020), which use subgraph isomorphism counts of graph patterns. We recall that an isomorphism from to is a bijective homomorphism from to which additionally satisfies (i) for every , and (ii) for every . When and are rooted graphs, isomorphisms should preserve the roots as well. Now, in a , the feature vector of each vertex is augmented with the the counts of every isomorphism from a rooted pattern to , for rooted patterns in a set of patterns ,222The use of orbits in Bouritsas et al. (2020) is here ignored, but explained in the appendix. and this is followed by the execution of an , just as for our -. It now remains to observe that subgraph isomorphism counts can be computed in terms of homomorphism counts, and vice versa (Curticapean et al., 2017). That is, can be viewed as - and thus our results for - carry over to . We adopt homomorphism counts instead of subgraph isomorphism counts because homomorphisms counts underly existing characterizations of the expressive power of and homomorphism counts are in general more efficient to compute. Also, Curticapean et al. (2017) indicate that all common graph counts are interchangeable in terms of expressive power.
3 Expressive Power of -
Recall that the standard -test (Weisfeiler & Lehman, 1968; Grohe, 2017) iteratively constructs a labelling of the vertices in a graph as follows. In round , for each vertex the algorithm first collects the label of and all of its neighbours after round , and then it hashes this aggregated multiset of labels into a new label for . The initial label of is . As shown in Xu et al. (2019) and Morris et al. (2019), the -test provides a bound on the classification power of : if two vertices or two graphs are indistinguishable by the test, then they will not be distinguished by any .
In turn, the expressive power of the -test, and thus of , can be characterised in terms of homomorphism counts of trees (Dvorak, 2010; Dell et al., 2018). This result can be seen as a characterisation of the expressiveness of the -test in terms of a particular infinite-dimensional graph kernel: the one defined by the number of homomorphisms from every tree into the underlying graph .
In this section we show that both characterisations extend in an elegant way to the setting of -, confirming that - are not just a useful, but also a well-behaved generalisation of standard .
3.1 Characterisation in terms of
We bound the expressive power of - in terms of what we call the -test. Formally, the -test extends in the same way as - extend standard : by including homomorphism counts of patterns in in the initial labelling. That is, let . The -test is a vertex labelling algorithm that iteratively computes a label for each vertex of a graph , defined as follows.
The -test stops in round when no new pair of vertices are identified by means of , that is, when for any two vertices and from , implies . Notice that .
We can use the -test to compare vertices of the same graphs, or different graphs. We say that the -test cannot distinguish vertices if their final labels are the same, and that the -test cannot distinguish graphs and if the multiset containing each label computed for is the same as that of .
Similarly as for and the -test (Xu et al., 2019; Morris et al., 2019), we obtain that the -test provides an upper bound for the expressive power of -.
Proposition 1.
If two vertices of a graph cannot be distinguished by the -test, then they cannot be distinguished by any - either. Moreover, if two graphs cannot be distinguished by the -test, then they cannot be distinguished by any - either.
We can also construct - that mimic the -test: Simply adding local parameters from a set of patterns to the architecture of Xu et al. (2019) results in an - that classifies vertices and graphs as the -test.
3.2 Characterisation in terms of -pattern trees
At the core of several results about the -test lies a characterisation linking the test with homomorphism counts of (rooted) trees (Dvorak, 2010; Dell et al., 2018). In view of the connection to , it tells that only use quantitative tree-based structural information from the underlying graphs. We next extend this characterisation to by using homomorphism counts of so-called -pattern trees. In view of the connection with - (Proposition 1), this reveals that - can use quantitative information of richer graph structures than .
To define -pattern trees we need the graph join operator . Given two rooted graphs and , the join graph is obtained by taking the disjoint union of and , followed by identifying with . The root of the join graph is . For example,
the join of and
is
. Further, if is a graph and is a rooted graph, then joining a vertex in with results
in the disjoint union of and , where is identified with .
Let . An -pattern tree is obtained from a standard rooted tree , called the backbone of , followed by joining every vertex with any number of copies of patterns from .
We define the depth of an -pattern tree as the depth of its backbone. Examples of -pattern trees, for , are:
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/6675a1b2-94f9-4158-b138-922732112281/x8.png)
where grey vertices are part of the backbones of the -pattern trees. Standard trees are also -pattern trees.
We next use -pattern trees to characterise the expressive power of and thus, by Proposition 1, of -.
Theorem 1.
For any finite collection of patterns, vertices and in a graph for every rooted -pattern tree . Similarly, and are undistinguishable by the -test if and only if , for every (unrooted) -pattern tree.
The proof of this theorem, which can be found in the appendix, requires extending techniques from Grohe (2020a, b) that were used to characterise the expressiveness of in terms of homomorphism counts of trees.
In fact, we can make the above theorem more precise. When is run for rounds, then only -patterns trees of depth are required. This tells that increasing the number of rounds of results in that more complicated structural information is taken into account For example, consider the two graphs and and vertices and , shown in Fig. 2. Let . By definition, cannot distinguish from based on the initial labelling. If run for one round, Theorem 1 implies that cannot distinguish from if and only if for any -pattern tree of depth at most . It is readily verified that
and thus distinguishes from after one round. Similarly, and can be distinguished by after one round. We observe that and are indistinguishable by . Hence, - are more expressive than .
Importantly, Theorem 1 discloses the boundaries of -. To illustrate this for some specific instances of - mentioned earlier, the expressive power of degree-based (Kipf & Welling, 2017; Geerts et al., 2020) is captured by -pattern trees, and walk counts- (Chen et al., 2019a) are captured by -pattern trees. These pattern trees are just trees, since joining paths to trees only results in bigger trees. Thus, Theorem 1 tells that all these extensions are still bounded by (albeit needing less rounds). In contrast, beyond , -pattern trees capture cycle count (Li et al., 2019), and for (Bouritsas et al., 2020) which use subgraph isomorphism counts of pattern , their expressive power is captured by -pattern trees, where consists of all surjective homomorphic images of patterns in (Curticapean et al., 2017).
4 A Comparison with the -test
We propose - as an alternative and efficient way to extend the expressive power of (and thus the -test) compared to the computationally intensive higher-order based on the -test (Morris et al., 2019, 2020; Maron et al., 2019a). In this section we situate in the hierarchy. The definition of is deferred to the appendix.
We have seen that can distinguish graphs that cannot: it suffices to consider for the -clique . In order to generalise this observation we need some notation. Let and be two sets of patterns and consider an - and a - . We say that is upper bounded in expressive power by if for any graph , if cannot distinguish vertices and ,333Just as for the -test, an - cannot distinguish two vertices if the label computed for both of them is the same then neither can . A similar notion is in place for pairs of graphs: if cannot distinguish graphs and , then neither can .
More generally, let be a class of - and be a class of -. We say that the class is upper bounded in expressive power by if every is upper bounded in expressive power by an (which may depend on ). When is upper bounded by and vice versa, then and are said to have the same expressive power. A class is more expressive than a class when is upper bounded in expressive power by , but there exist graphs that can be distinguished by in but not by any in .
Finally, we use the notion of treewidth of a graph, which measures the tree-likeness of a graph. For example, trees have treewidth one, cycles have treewidth two, and the -clique has treewidth (for ). We define this standard notion in the appendix and only note that we define the treewidth of a pattern as the treewidth of its unrooted version .
Our first result is a consequence of the characterisation of in terms of homomorphism counts of graphs of treewidth (Dvorak, 2010; Dell et al., 2018).
Proposition 2.
For each finite set of patterns, the expressive power of is bounded by , where is the largest treewidth of a pattern in .
For example, since the treewidth of is , we have that is bounded by . Similarly, is bounded in expressive power by .
Our second result tells how to increase the expressive power of beyond . A pattern is a core if any homomorphism from to itself is injective. For example, any clique and cycle of odd length is a core.
Theorem 2.
Let be a finite set of patterns. If contains a pattern which is a core and has treewidth , then there exist graphs that can be distinguished by but not by .
In other words, for such , is not bounded by . For example, since is a core, is not bounded in expressive power by . More generally, is not bounded by . The proof of Theorem 2 is based on extending deep techniques developed in finite model theory, and that have been used to understand the expressive power of finite variable logics (Atserias et al., 2007; Bova & Chen, 2019). This result is stronger than the one underlying the strictness of the hierarchy (Otto, 2017), which states that is strictly more expressive than . Indeed, Otto (2017) only shows the existence of a pattern of treewidth such that is not bounded by . In Theorem 2 we provide an explicit recipe for finding such a pattern , that is, can be taken a core of treewidth .
In summary, we have shown that there is a set of patterns such that (i) can distinguish graphs which cannot be distinguished by , yet (ii) cannot distinguish more graphs than . This begs the question whether there is a finite set such that is equivalent in expressive power to . We answer this negatively.
Proposition 3.
For any , there does not exist a finite set of patterns such that is equivalent in expressive power to .
In view of the connection between - and mentioned earlier, we thus show that no can match the power of , which was a question left open in Bouritsas et al. (2020). We remark that if we allow to consist of all (infinitely many) patterns of treewidth , then is equivalent in expressive power to (Dvorak, 2010; Dell et al., 2018).
5 When Do Patterns Extend Expressiveness?
Patterns are not learned, but must be passed as an input to together with the graph structure. Thus, knowing which patterns work well, and which do not, is of key importance for the power of the resulting -. This is a difficult question to answer since determining which patterns work well is clearly application-dependent. From a theoretical point of view, however, we can still look into interesting questions related to the problem of which patterns to choose. One such a question, and the one studied in this section, is when a pattern adds expressive power over the ones that we have already selected. More formally, we study the following problem: Given a finite set of patterns, when does adding a new pattern to extends the expressive power of the -test?
To answer this question in the positive, we need to find two graphs and , show that they are indistinguishable by the -test, but show that they can be distinguished by the -test. As an example of this technique we show that longer cycles always add expressive power. We use to represent the cycle of length .
Proposition 4.
For any , is more expressive than .
We also observe that, by Proposition 2, is bounded by for any because cycles have treewidth two. It is often the case, however, that finding such graphs and proving that they are indistinguishable, can be rather challenging. Instead, in this section we provide two techniques that can be used to partially answer the question posed above by only looking at properties of the sets of patterns. Our first result is for establishing when a pattern does not add expressive power to a given set of patterns, and the second one when it does.
5.1 Detecting when patterns are superfluous
Our first result is a simple recipe for choosing local features: instead of choosing complex patterns that are the joins of smaller patterns, one should opt for the smaller patterns.
Proposition 5.
Let be a pattern that is the join of two smaller patterns. Then for any set of patterns, we have that is upper bounded by .
Stated differently, this means that adding to any pattern which is the join of two patterns already in does not add expressive power.
Thus, instead of using, for example, the pattern , one should prefer to use instead the triangle
. This result is in line with other advantages of smaller patterns: their homomorphism counts are easier to compute, and, since they are less specific, they should tend to produce less over-fitting.
5.2 Detecting when patterns add expressiveness
Joining patterns into new patterns does not give extra expressive power, but what about patterns which are not joins? We provide next a useful recipe for detecting when a pattern does add expressive power. We recall that the core of a graph is its unique (up to isomorphism) induced subgraph which is a core.
Theorem 3.
Let be a finite set of patterns and let be a pattern whose core has treewidth . Then, is more expressive than if every pattern satisfies one of the following conditions: (i) has treewidth ; or (ii) does not map homomorphically to .
As an example, is more expressive than for any because of the first condition. Similarly, is more expressive than for odd cycles . Indeed, such cycles are cores and no clique with maps homomorphically to .
6 Experiments
We next showcase that architectures benefit when homomorphism counts of patterns are added as additional vertex features. For patterns where homomorphism and subgraph isomorphism counts differ (e.g., cycles) we compare with (Bouritsas et al., 2020). We use the benchmark for by Dwivedi et al. (2020), as it offers a broad choice of models, datasets and graph classification tasks.
Selected .
We select the best architectures from Dwivedi et al. (2020): Graph Attention Networks () (Velickovic et al., 2018), Graph Convolutional Networks () (Kipf & Welling, 2017), (Hamilton et al., 2017), Gaussian Mixture Models () (Monti et al., 2017) and (Bresson & Laurent, 2017). We leave out various linear architectures such as (Xu et al., 2019) as they were shown to perform poorly on the benchmark.
Learning tasks and datasets.
As in Dwivedi et al. (2020) we consider (i) graph regression and the ZINC dataset (Irwin et al., 2012a; Dwivedi et al., 2020); (ii) vertex classification and the PATTERN and CLUSTER datasets (Dwivedi et al., 2020); and (iii) link prediction and the COLLAB dataset (Hu et al., 2020). We omit graph classification: for this task, the graph datasets from Dwivedi et al. (2020) originate from image data and hence vertex neighborhoods carry little information.
Patterns.
We extend the initial features of vertices with homomorphism counts of cycles of length , when molecular data (ZINC) is concerned, and with homomorphism counts of -cliques for , when social or collaboration data (PATTERN, CLUSTER, COLLAB) is concerned. We use the -score of the logarithms of homomorphism counts to make them standard-normally distributed and comparable to other features. Section 5 tells us that all these patterns will increase expressive power (Theorem 3 and Proposition 4) and are “minimal” in the sense that they are not the join of smaller patterns (Proposition 5). Similar pattern choices were used in Bouritsas et al. (2020). We use DISC (Zhang et al., 2020)444We thank the authors for providing us with an executable., a tool specifically built to get homomorphism counts for large datasets. Each model is trained and tested independently using combinations of patterns.
Higher-order .
We do not compare to higher-order since this was already done by Dwivedi et al. (2020). They included ring- (which outperform -) and - in their experiments, and these were outperformed by our selected “linear” architectures. Although the increased expressive power of higher-order may be beneficial for learning, scalability and learning issues (e.g., loss divergence) hamper their applicability (Dwivedi et al., 2020). Our approach thus certainly outperforms higher-order with respect to the benchmark.
Methodology.
Graphs were divided between training/test as instructed by Dwivedi et al. (2020), and all numbers reported correspond to the test set. The reported performance is the average over four runs with different random seeds for the respective combinations of patterns in , model and dataset. Training times were comparable to the baseline of training models without any augmented features.555Code to reproduce our experiments is available at https://github.com/LGP-GNN-2021/LGP-GNN All models for ZINC, PATTERN and COLLAB were trained on a GeForce GTX 1080 Ti GPU, for CLUSTER a Tesla V100-SXM3-32GB GPU was used.
Next we summarize our results for each learning task separately.
Graph regression.
The first task of the benchmark is the prediction of the solubility of molecules in the ZINC dataset (Irwin et al., 2012a; Dwivedi et al., 2020), a dataset of about graphs of small size, each of them consisting of one particular molecule. The results in Table 1 show that each of our models indeed improves by adding homomorphism counts of cycles and the best result is obtained by considering all cycles. were applied to the ZINC dataset as well (Bouritsas et al., 2020). In Table 1 we also report results by using subgraph isomorphism counts (as in ): homomorphism counts generally provide better results than subgraph isomorphisms counts. Our best result (framed in Table 1) is competitive to the value of reported in Bouritsas et al. (2020). By looking at the full results, we see that some cycles are much more important than others. Table 2 shows which cycles have greatest impact for the worst-performing baseline, . Remarkably, adding homomorphism counts makes the model competitive with the best performers of the benchmark.
Model | MAE (base) | MAE (hom) | MAE (iso) |
---|---|---|---|
0.470.02 | 0.220.01 | 0.240.01 | |
0.350.01 | 0.200.01 | 0.220.01 | |
0.440.01 | 0.240.01 | 0.240.01 | |
0.250.01 | 0.190.01 | 0.160.01 | |
0.340.05 | 0.13530.01 | 0.13570.01 |
Set | MAE |
---|---|
None | 0.470.02 |
0.450.01 | |
0.340.02 | |
0.310.01 | |
0.280.01 | |
0.230.01 | |
0.220.01 |
Vertex classification.
The next task in the benchmark corresponds to vertex classification. Here we analyze two datasets, PATTERN and CLUSTER (Dwivedi et al., 2020), both containing over artificially generated graphs resembling social networks or communities. The task is to predict whether a vertex belongs to a particular cluster or pattern, and all results are measured using the accuracy of the classifier. Also here, our results show that homomorphism counts, this times of cliques, tend to improve the accuracy of our models. Indeed, for the PATTERN dataset we see an improvement in all models but (Table 3), and three models are improved in the CLUSTER dataset (reported in the appendix). Once again, the best performer in this task is a model that uses homomorphism counts. We remark that for cliques, homomorphism counts coincide with subgraph isomorphism counts (up to a constant factor) so our extensions behave like .
Model + best | Accuracy baseline | Accuracy best |
---|---|---|
| 78.83 0.60 | 85.50 0.23 |
| 71.42 1,38 | 82.49 0.48 |
70.78 0,19 | 85,85 0.15 | |
85.90 0,03 | 86.63 0.03 | |
86.15 0.08 | 86.15 0.08 |
Link prediction
In our final task we consider a single graph, COLLAB (Hu et al., 2020), with over vertices, containing information about the collaborators in an academic network, and the task at hand is to predict future collaboration. The metric used in the benchmark is the Hits@50 evaluator (Hu et al., 2020). Here, positive collaborations are ranked among randomly sampled negative collaborations, and the metric is the ratio of positive edges that are ranked at place 50 or above. Once again, homomorphism counts of cliques improve the performance of all models, see Table 4. An interesting observation is that this time the best set of features (cliques) does depend on the model, although the best model uses all cliques again.
Model + best | Hits@50 baseline | Hits@50 best |
---|---|---|
| 50.320.55 | 52.870.87 |
| 51.351.30 | 54.601.01 |
| 50.330.68 | 51.391.23 |
| 49.811.56 | 51.761.38 |
| 51.002.54 | 51.570.68 |
Remarks.
The best performers in each task use homomorphism counts, in accordance with our theoretical results, showing that such counts do extend the power of . Homomorphism counts are also cheap to compute. For COLLAB, the largest graph in our experiments, the homomorphism counts of all patterns we used, for all vertices, could be computed by DISC (Zhang et al., 2020) in less than 3 minutes. One important remark is that selecting the best set of features is still a challenging endeavor. Our theoretical results help us streamline this search, but for now it is still an exploratory task. In our experiments we first looked at adding each pattern individually, and then tried with combinations of those that showed the best improvements. This feature selection strategy incurs considerable cost, both computational and environmental, and needs further investigation.
7 Conclusion
We propose local graph parameter enabled as an efficient way to increase the expressive power of . The take-away message is that enriching features with homomorphism counts of small patterns is a promising add-on to any architecture, with little to no overhead. Regarding future work, the problem of which parameters to choose deserves further study. In particular, we plan to provide a complete characterisation of when adding a new pattern to adds expressive power to the -test.
References
- Abbe (2018) Abbe, E. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86, 2018. URL http://jmlr.org/papers/v18/16-480.html.
- Abboud et al. (2020) Abboud, R., Ceylan, İ. İ., Grohe, M., and Lukasiewicz, T. The surprising power of graph neural networks with random node initialization. arXiv, 2020. URL https://arxiv.org/abs/2010.01179.
- Arvind et al. (2020) Arvind, V., Fuhlbrück, F., Köbler, J., and Verbitsky, O. On Weisfeiler-Leman invariance: Subgraph counts and related graph properties. Journal of Computer and System Sciences, 113:42 – 59, 2020. URL https://doi.org/10.1016/j.jcss.2020.04.003.
- Atserias et al. (2007) Atserias, A., Bulatov, A. A., and Dalmau, V. On the power of -consistency. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming, volume 4596 of Lecture Notes in Computer Science, pp. 279–290, 2007. URL https://doi.org/10.1007/978-3-540-73420-8_26.
- Azizian & Lelarge (2021) Azizian, W. and Lelarge, M. Expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=lxHgXYN4bwl.
- Barceló et al. (2020) Barceló, P., Kostylev, E. V., Monet, M., Pérez, J., Reutter, J., and Silva, J. P. The logical expressiveness of graph neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1lZ7AEKvB.
- Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003. URL https://ieeexplore.ieee.org/document/6789755.
- Bouritsas et al. (2020) Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. arXiv, 2020. URL https://arxiv.org/abs/2006.09252.
- Bova & Chen (2019) Bova, S. and Chen, H. How many variables are needed to express an existential positive query? Theory Comput. Syst., 63(7):1573–1594, 2019. URL https://doi.org/10.1007/s00224-018-9884-z.
- Bresson & Laurent (2017) Bresson, X. and Laurent, T. Residual gated graph convnets. arXiv, 2017. URL https://arxiv.org/abs/1711.07553.
- Cai et al. (1992) Cai, J., Fürer, M., and Immerman, N. An optimal lower bound on the number of variables for graph identifications. Combinatorica, 12(4):389–410, 1992. URL https://doi.org/10.1007/BF01305232.
- Chami et al. (2021) Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., and Murphy, K. Machine learning on graphs: A model and comprehensive taxonomy. arXiv, 2021. URL https://arxiv.org/abs/2005.03675.
- Chen et al. (2019a) Chen, T., Bian, S., and Sun, Y. Are powerful graph neural nets necessary? A dissection on graph classification. arXiv, 2019a. URL https://arxiv.org/abs/1905.04579.
- Chen et al. (2019b) Chen, Z., Villar, S., Chen, L., and Bruna, J. On the equivalence between graph isomorphism testing and function approximation with GNNs. In Advances in Neural Information Processing Systems, volume 32, pp. 15894–15902, 2019b. URL https://proceedings.neurips.cc/paper/2019/file/71ee911dd06428a96c143a0b135041a4-Paper.pdf.
- Chen et al. (2020) Chen, Z., Chen, L., Villar, S., and Bruna, J. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://arxiv.org/abs/2002.04025.
- Curticapean et al. (2017) Curticapean, R., Dell, H., and Marx, D. Homomorphisms are a good basis for counting small subgraphs. In Proceedings of the 49th Symposium on Theory of Computing, pp. 210––223, 2017. URL http://dx.doi.org/10.1145/3055399.3055502.
- Damke et al. (2020) Damke, C., Melnikov, V., and Hüllermeier, E. A novel higher-order Weisfeiler-Lehman graph convolution. arXiv, 2020. URL https://arxiv.org/abs/2007.00346.
- Dell et al. (2018) Dell, H., Grohe, M., and Rattan, G. Lovász meets Weisfeiler and Leman. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, volume 107, pp. 40:1–40:14, 2018. URL https://doi.org/10.4230/LIPIcs.ICALP.2018.40.
- Dvorak (2010) Dvorak, Z. On recognizing graphs by numbers of homomorphisms. Journal of Graph Theory, 64(4):330–342, 2010. URL https://doi.org/10.1002/jgt.20461.
- Dwivedi et al. (2020) Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. arXiv, 2020. URL https://arxiv.org/abs/2003.00982.
- Fürer (2017) Fürer, M. On the combinatorial power of the Weisfeiler-Lehman algorithm. In Proceedings of the 10th International Conference on Algorithms and Complexity, volume 10236 of Lecture Notes in Computer Science, pp. 260–271, 2017. URL https://doi.org/10.1007/978-3-319-57586-5_22.
- Garg et al. (2020) Garg, V. K., Jegelka, S., and Jaakkola, T. S. Generalization and representational limits of graph neural networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 3419–3430, 2020. URL http://proceedings.mlr.press/v119/garg20c.html.
- Geerts (2020) Geerts, F. The expressive power of th-order invariant graph networks. arXiv, 2020. URL https://arxiv.org/abs/2007.12035.
- Geerts et al. (2020) Geerts, F., Mazowiecki, F., and Pérez, G. A. Let’s agree to degree: Comparing graph convolutional networks in the message-passing framework. arXiv, 2020. URL https://arxiv.org/abs/2004.02593.
- Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1263–1272, 2017. URL {http://proceedings.mlr.press/v70/gilmer17a/gilmer17a.pdf}.
- Grohe (2017) Grohe, M. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Lecture Notes in Logic. Cambridge University Press, 2017. URL https://doi.org/10.1017/9781139028868.
- Grohe (2020a) Grohe, M. word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Proceedings of the 39th Symposium on Principles of Database Systems, pp. 1–16, 2020a. URL https://doi.org/10.1145/3375395.3387641.
- Grohe (2020b) Grohe, M. Counting bounded tree depth homomorphisms. In Proceedings of the 35th Symposium on Logic in Computer Science, pp. 507–520, 2020b. URL https://doi.org/10.1145/3373718.3394739.
- Hamilton et al. (2017) Hamilton, W. L., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, volume 30, pp. 1024–1034, 2017.
- Hella (1996) Hella, L. Logical hierarchies in PTIME. Inf. Comput., 129(1):1–19, 1996. URL https://doi.org/10.1006/inco.1996.0070.
- Hu et al. (2020) Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://arxiv.org/abs/2005.00687.
- Irwin et al. (2012a) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757–1768, 2012a. URL https://doi.org/10.1021/ci3001277.
- Irwin et al. (2012b) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757–1768, 2012b. URL https://doi.org/10.1021/ci3001277.
- Ishiguro et al. (2020) Ishiguro, K., Oono, K., and Hayashi, K. Weisfeiler-Lehman embedding for molecular graph neural networks. arXiv, 2020. URL https://arxiv.org/abs/2006.06909.
- Keriven & Peyré (2019) Keriven, N. and Peyré, G. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems, volume 32, pp. 7092–7101, 2019. URL https://proceedings.neurips.cc/paper/2019/file/ea9268cb43f55d1d12380fb6ea5bf572-Paper.pdf.
- Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJU4ayYgl.
- Lacoste et al. (2019) Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
- Li et al. (2019) Li, M. L., Dong, M., Zhou, J., and Rush, A. M. A hierarchy of graph neural networks based on learnable local features. arXiv, 2019. URL https://arxiv.org/abs/1911.05256.
- Loukas (2020) Loukas, A. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l2bp4YwS.
- Lovász (1967) Lovász, L. Operations with structures. Acta Mathematica, 18:321–328, 1967. URL https://doi.org/10.1007/BF02280291.
- Maron et al. (2019a) Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. In Advances in Neural Information Processing Systems, volume 32, pp. 2153–2164, 2019a. URL http://papers.nips.cc/paper/8488-provably-powerful-graph-networks.
- Maron et al. (2019b) Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=Syx72jC9tm.
- Maron et al. (2019c) Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 4363–4371, 2019c. URL http://proceedings.mlr.press/v97/maron19a.html.
- Merkwirth & Lengauer (2005) Merkwirth, C. and Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks. J. Chem. Inf. Model., 45(5):1159–1168, 2005. URL https://pubs.acs.org/doi/pdf/10.1021/ci049613b.
- Monti et al. (2017) Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. Geometric deep learning on graphs and manifolds using mixture model CNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5115–5124, 2017. URL https://openaccess.thecvf.com/content_cvpr_2017/papers/Monti_Geometric_Deep_Learning_CVPR_2017_paper.pdf.
- Morris et al. (2019) Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and Leman go neural: Higher-order graph neural networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 4602–4609, 2019. URL https://doi.org/10.1609/aaai.v33i01.33014602.
- Morris et al. (2020) Morris, C., Rattan, G., and Mutzel, P. Weisfeiler and leman go sparse: Towards scalable higher-order graph embeddings. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://arxiv.org/abs/1904.01543.
- NT & Maehara (2020) NT, H. and Maehara, T. Graph homomorphism convolution. arXiv, 2020. URL https://arxiv.org/abs/2005.01214.
- Otto (2017) Otto, M. Bounded Variable Logics and Counting: A Study in Finite Models, volume 9 of Lecture Notes in Logic. Cambridge University Press, 2017. URL https://doi.org/10.1017/9781316716878.
- Sato (2020) Sato, R. A survey on the expressive power of graph neural networks. arXiv, 2020. URL https://arxiv.org/abs/2003.04078.
- Sato et al. (2019) Sato, R., Yamada, M., and Kashima, H. Approximation ratios of graph neural networks for combinatorial problems. In Advances in Neural Information Processing Systems, volume 32, pp. 4083–4092, 2019. URL https://papers.nips.cc/paper/2019/file/635440afdfc39fe37995fed127d7df4f-Paper.pdf.
- Sato et al. (2020) Sato, R., Yamada, M., and Kashima, H. Random features strengthen graph neural networks. arXiv, 2020. URL https://arxiv.org/abs/2002.03155.
- Scarselli et al. (2009) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009. URL https://doi.org/10.1109/TNN.2008.2005605.
- Shervashidze et al. (2009) Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. Efficient graphlet kernels for large graph comparison. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, volume 5, pp. 488–495, 2009. URL http://proceedings.mlr.press/v5/shervashidze09a.html.
- Tahmasebi & Jegelka (2020) Tahmasebi, B. and Jegelka, S. Counting substructures with higher-order graph neural networks: Possibility and impossibility results. arXiv, 2020. URL https://arxiv.org/abs/2012.03174.
- Velickovic et al. (2018) Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
- Vignac et al. (2020) Vignac, C., Loukas, A., and Frossard, P. Building powerful and equivariant graph neural networks with structural message-passing. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/a32d7eeaae19821fd9ce317f3ce952a7-Abstract.html.
- Weisfeiler & Lehman (1968) Weisfeiler, B. J. and Lehman, A. A. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9):12–16, 1968. https://www.iti.zcu.cz/wl2018/pdf/wl_paper_translation.pdf.
- Wu et al. (2021) Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021. URL https://doi.org/10.1109/TNNLS.2020.2978386.
- Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Zhang et al. (2020) Zhang, H., Yu, J. X., Zhang, Y., Zhao, K., and Cheng, H. Distributed subgraph counting: A general approach. Proc. VLDB Endow., 13(11):2493–2507, 2020. URL http://www.vldb.org/pvldb/vol13/p2493-zhang.pdf.
- Zhou et al. (2018) Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun, M. Graph neural networks: A review of methods and applications. ArXiv, 2018. URL http://arxiv.org/abs/1812.08434.
Appendix
Appendix A Proofs of Section 3
We use the following notions. Let and be graphs, , , and . The vertices and are said to be indistinguishably by in round , denoted by , iff . Similarly, and are said to be indistinguishable by in round , denoted by , iff . Along the same lines, and are indistinguishable by an - , denoted by , iff . Similarly, and are said to be indistinguishable by in round , denoted by , iff .
A.1 Proof of Proposition 1
We show that the class of - is upper bounded in expressive power by . The proof is analogous to the proof in Morris et al. (2019) showing that are bounded by .
We show a stronger result by upper bounding - by -test, layer by layer. More precisely, we show that for every - , graphs and , vertices , , and ,
-
(1)
; and
-
(2)
.
Clearly, these imply that - are bounded in expressive power by , both when vertex and graph distinguishability are concerned.
Proof of implication (1). We show this implication by induction on the number of rounds.
Base case. We first assume . In other words, and thus, and for every we have . By definition, is a hot-one encoding of combined with for , for every , graph and vertex . Since these agree with the labelling and homomorphism counts for vertex in graph , we also have that , as desired.
Inductive step. We next assume . By the definition of this is equivalent to and . By the induction hypothesis, this implies and there exists a bijection such that for every , and every - . In other words, and for every . By the definition of - this implies that is equal to and hence also, after applying , . That is, , as desired.
Proof of implication (2). The implication now easily follows. Indeed, is equivalent to . In other words, there exists a bijection such that for every . We have just shown that this implies for every and for every - . Hence, , or , as desired. ∎
A.2 Proof of Theorem 1
We show that for any finite collection of patterns, graphs and , vertices and , and :
(1) |
for every -pattern tree of depth at most . Similarly,
(2) |
for every (unrooted) -pattern tree of depth at most .
For a given set of patterns and , we denote by the graph pattern of the form , that is, we join copies of , copies of and so on.
Proof of equivalence (1). The proof is by induction on the number of rounds .
We first consider the implication for every -pattern tree of depth at most .
Base case. Let us first consider the base case, that is, . In other words, we consider -pattern trees consisting of a single root adorned with a pattern for some . We note that due to the properties of the graph join operator:
(3) |
Since, , we know that for some and for all . This implies that the product in (3) is equal to
as desired.
Inductive step. Suppose next that we know that the implication holds for . We assume now and consider an -pattern tree of depth at most . Assume that in the backbone of , the root has children , and denote by the -pattern trees in rooted at . Furthermore, we denote by the -pattern tree obtained from by attaching to ; has root . Let be the pattern in associated with . The following equalities are readily verified:
(4) |
Recall now that we assume and thus, in particular, . Hence, by induction, for every -pattern tree of depth . In particular, this holds for and hence
Furthermore, implies that there exists a bijection such that for every . By induction, for every there thus exists a unique such that for every -pattern tree of depth at most . In particular, for every there exists a such that
for each of the sub-trees in . Hence, (4) is equal to
which in turn is equal to , as desired.
We next consider the other direction, that is, we show that when holds for every -pattern tree of depth at most , then holds. This is again verified by induction on . This direction is more complicated and is similar to techniques used in Grohe (2020b). In our induction hypothesis we further include that a finite number of -pattern trees suffices to infer for graphs and and vertices and .
Base case. Let us consider the base case first. We need to show that and for every , since this implies .
We first observe that for every -pattern tree of depth , implies that and must be assigned the same label, say , by and , respectively.
Indeed, if we take to consist of a single root labeled with (and thus is associated with the pattern ), then will be one if and zero otherwise. This implies that .
Next, we show that for every . It suffices to consider the -pattern tree consisting of a root joined with a single copy of .
We observe that we only need a finite number of -pattern trees to infer . Indeed, suppose that and assign labels , then we need single vertex trees with no patterns attached and root labeled with one of these labels. In addition, we need one -pattern tree for each pattern and each label . That is, we need -pattern trees of depth .
Inductive step. We now assume that the implication holds for and consider . That is, we assume that if holds for every -pattern tree of depth at most , then holds. Furthermore, we assume that only a finite number of -pattern trees of depth at most suffice to infer .
So, for , let us assume that holds for every -pattern tree of depth at most . We need to show and that we can again assume that a finite number of -pattern trees of depth at most suffice to infer .
By definition of , we can, equivalently, show that and that there exists a bijection such that for every . That holds, is by induction, since for every -pattern tree of depth at most and thus also for every -pattern tree of depth at most . We may thus focus on showing the existence of the bijection .
Let , and . We know, by induction and the proof of the previous implication, that if and only if for each . Denote by the equivalence class on induced by . Furthermore, define and let and for and , for each . If we can show that for each , then this implies the existence of the desired bijection.
Let be the -pattern tree of depth at most obtained by attaching to a new root vertex labeled with . We may assume that and both have label , since their homomorphism counts for the single root trees with labels from . The root vertex is not joined with any (or alternatively it is joined with ). It will be convenient to denote the root of by instead of . Then for each :
where and denote arbitrary vertices in and , respectively. Let us denote by and observe that this is equal to . Hence, we know that for each :
Let us call a set compatible if all roots in , for , have the same label. Consider a vector and define its support as . We say that is compatible if its support is. For such a compatible we now define to be the -pattern tree with root labeled with , with one child which is joined with (and inheriting the label from) the following -pattern tree of depth :
In other words, we simply join together powers of the ’s that have roots with the same label. Then for every compatible :
where, as before, and denote arbitrary vertices in and , respectively. Hence, for any compatible :
We now continue in the same way as in the proof of Lemma 4.2 in Grohe (2020b). We repeat the argument here for completeness. Define for each . We assume, for the sake of contradiction, that there exists a such that . We choose such a for which is inclusion-wise maximal.
We first rule out that . Indeed, suppose that . This implies that . Now observe that and are mutually distinct for all , . Indeed, if they were equal then this would imply that . Hence, for any . We note that for all by the maximality of . Hence, , contradicting our assumption. Hence, .
Consider . For each , consider the truncated vector . We note that , for , all have positive entries and are mutually distinct. Lemma 4.1 in Grohe (2020b) implies that we can find a vector (with non-zero entries) such that the numbers for are mutually distinct as well. We next consider with if and otherwise. Then by definition of , also for are mutually distinct.
We next note that for every , and if we define to be the -matrix such that then this will be an invertible matrix (Vandermonde). We use this invertibility to show that .
Let and . If we inspect the th entry of , then this is equal to
We want to reduce the above expression to
To see that this holds, we verify that when then . Indeed, take an such that . Then, contains the factor with . Hence, .
Now, by the maximality of , for all with we have and thus
Since , we thus also have that
Since this holds for all , we have and by the invertibility of , . In particular, since , contradicting our assumption.
As a consequence, for all and thus we have our desired bijection.
It remains to verify that we only need a finite number of -pattern trees to conclude that for all . In fact, the above proof indicates that we just need to check test for each root label , we need to check identities for the finite number of pattern trees used to define the matrix .
Proof of equivalence 2 This equivalence is shown just like proof of Theorem 4.4. in Grohe (2020a) with the additional techniques from Lemma 4.2 in Grohe (2020b).
We first show that implies for unrooted -pattern trees of depth at most .
Assume that for . For and , define if and only if for all -pattern trees of depth at most . Let be the -equivalence classes and for each , let and . Suppose that . This implies that for every .
Let be an unrooted -pattern tree of depth at most , let be any vertex on the backbone of , and let be the rooted -pattern tree obtained from by declaring as its root. By definition, for , any , are all the same number, only dependent on . Hence,
where and are arbitrary vertices in and , respectively, and where we used that and . Since this holds for any unrooted -pattern tree of depth at most , we have show the desired implication.
We next check the other direction. That is, we assume that holds for any unrooted -pattern tree of depth at most and verify that .
For to hold for , and , we earlier showed that this corresponds to checking whether for a finite number rooted -pattern trees . By definition of the ’s, for is well-defined (independent of the choice of ). For the rooted ’s we denote by its unrooted version. Similarly as before,
We next show that for . In fact, this is shown in precisely the same way as in our previous characterisation and based on Lemma 4.2 in Grohe (2020b). That is, we again consider trees obtained by joining copies of the ’s, to obtain, for compatible ,
It now suffices to repeat the same argument as before (details omitted). ∎
Appendix B Proofs of Section 4
B.1 Additional details of standard concepts
Core and treewidth.
A graph is a core if all homomorphisms from to itself are injective. The treewidth of a graph is a measure of how much resembles a tree. This is defined in terms of the tree decompositions of , which are pairs , for a tree and a mapping that associates each vertex of with a set , satisfying the following:
-
•
The union of , for , is equal to ;
-
•
The set is connected, for all ; and
-
•
For each there is with .
The width of is . The treewidth of is the minimum width of its tree decompositions. For instance, trees have treewidth one, cycles have clique two, and the -clique has treewidth (for ).
If is a pattern, then its treewidth is defined as the treewidth of the graph . Similarly, is a core if is.
.
A partial isomorphism from a graph to a graph is a set such that all satisfy the equivalences , , and . We may view as a bijective mapping from a subset to a subset of that is an isomorphism from the induced subgraph to the induced subgraph . The isomorphism type of a -tuple is a label in some alphabet such that if and only if is a partial isomorphism from to .
Let and . The -dimensional Weisfeiler-Leman algorithm () computes a sequence of labellings from . We denote by the label assigned to the -tuple in round . The initial labelling assigns to each -tuple is isomorphism type . Then, for round ,
where is the multiset
As observed in Dell et al. (2018), if holds, then we can omit the entry from the tuples in , because all the information it contains is also contained in the entries of these tuples. Also, in the sense that if and only if for all . The algorithm is run until the labelings stabilises, i.e., if for all , if and only if . We say that distinguishes two graphs and if the multisets of labels for all -tuples of vertices in and , respectively, coincides. Similar notions as are place for distinguishing -tuples, and for distinguishing graphs (or vertices) based on labels computed by a given number of rounds.
We remark that algorithm given here is sometimes referred to as the “folklore” version of the -dimensional Weisfeiler-Leman algorithm. It is known that indistinguishability of graphs by is equivalent to indistinguishability by sentences in the the -variable fragment of first order logic with counting (Cai et al., 1992), and to for every graph of treewidth (Dvorak, 2010; Dell et al., 2018).
B.2 Proof of Proposition 2
We show that for each finite set of patterns, the expressive power of is bounded by , where is the largest treewidth of a pattern in .
We first recall the following characterisation of -equivalence (Dvorak, 2010; Dell et al., 2018). For any two graphs and ,
for every graph of treewidth at most . On the other hand, we know from Theorem 1 that
for every -pattern tree . Hence, we may conclude that
if we can show that any -pattern tree has treewidth at most .
Suppose that is the maximal treewidth of a pattern in . To conclude the proof, we verify that the treewidth of any -pattern tree is bounded by .
Lemma 1.
If is the maximal treewidth of a pattern in , then the treewidth of any -pattern tree is bounded by .
Proof.
The proof is by induction on the number of patterns joined at any leaf of . Clearly, if no patterns are joined, then is simply a tree and its treewidth is . Otherwise, consider a -pattern tree whose treewidth is at most , and a pattern of treewidth that is to be joined at vertex of . By the induction hypothesis, there is a decomposition for witnessing its bounded treewidth, that is,
-
1.
The union of all , for , is equal to ;
-
2.
The set is connected, for all ;
-
3.
For each there is with ; and
-
4.
The size of each set is at most .
Likewise, by assumption, for pattern we have such a tree decomposition, say .
Now consider any vertex of the decomposition of such that contains vertex in to which is to be joined at its root. We can create a joint tree decomposition for the join of and (at node ) by merging and with an edge from vertex in to the root of (recall is a tree by definition). It is readily verified that this decomposition maintains all necessary properties. Indeed, condition 1 is clearly satisfied since and combined cover all vertices of the join of with . Furthermore, since the only node shared by and is the join node, and we merge and by putting an edge from node in to the root of , connectivity of is guaranteed and condition 2 is satisfied. Moreover, since the operation of joining and does not create any extra edges, condition 2 is immediately verified, and so is 3, because we do not create any new vertices, neither in nor in , and we already know that and are bounded by . ∎
B.3 Proof of Theorem 2
We show that if contains a pattern which is a core and has treewidth , then is not bounded by . In other words, we construct two graphs and that can be distinguished by but not by . It suffices to find such graphs that can be distinguished by but not by . The proof relies on the characterisation of indistinguishability in terms of the -variable fragment of first logic with counting and of -pebble bijective games in particular (Cai et al., 1992; Hella, 1996). More precisely, if and only if no sentence in can distinguish from . In other words, for any sentence in , if and only if . We denote indistinguishability by by . We heavily rely on the constructions used in Atserias et al. (2007) and Bova & Chen (2019). In fact, we show that the graphs and constructed in those works, suffice for our purpose, by extending their strategy for the -pebble game to -pebble bijective games.
Construction of the graphs and .
Let be a pattern in which is a core and has treewidth . For a vertex , we gather all its edges in . Let be one of the vertices in .
For , as vertex set we take vertices of the form with and . We require that
For , as vertex set we take vertices of the form with and . We require that , for all . We observe that and have the same number of vertices.
The edge sets and of and , respectively, are defined as follows: and are adjacent if and only if and furthermore,
It is known that (here it is used that is a core), and and are indistinguishably by means of sentences in the -variable fragment of first order logic (Atserias et al., 2007; Bova & Chen, 2019). To show our theorem, we thus need to verify that as well. Indeed, for if this holds, then yet . Indeed, Theorem 1 implies that for to hold, , which we know not to be true. Hence, , as desired.
Showing -indistinguishability of and .
We next show that the graphs and are indistinguishable by sentences in . This will be shown by verifying that the Duplicator has a winning strategy for the -pebble bijective game on and (Hella, 1996).
The -pebble bijective game. We recall that the -pebble bijective game is played between two players, the Spoiler and the Duplicator, each placing at most pebbles on the vertices of and , respectively. The game is played in a number of rounds. The pebbles placed after round are typically represented by a partial function . When is defined, say, , this means that the Spoiler places the th pebble on vertex and the Duplicator places the th pebble on . Initially, no pebbles are placed on and and hence is undefined everywhere.
Then in round , the game proceeds as follows:
-
1.
The Spoiler selects a pebble in . All other already placed pebbles are kept on the same vertices. We define for all , .
-
2.
The Duplicator responds by choosing a bijection . This bijection should be consistent with the pebbles in the restriction of to . That is, for every , , if then .
-
3.
Next, the Spoiler selects an element .
-
4.
The Duplicator defines . Hence, after this round, the th pebble is placed on by the Spoiler and on by the Duplicator.
Let be the elements in for which is defined. For denote by the pair of vertices on which the th pebble is placed. The Duplicator wins round if the mapping is partial isomorphism between and . More precisely, it should hold that for all edges if and only if . In this case, the game continues to the next round. Infinite games are won by the Duplicator. A winning strategy consists of defining a bijection in step 2 in each round, allowing the game to continue, irregardless of which vertex the Spoiler places a pebble in Step 3.
Winning strategy. We will now provide a winning strategy for the -bijective game on our constructed graphs and . We recall that and have the same number of vertices, so a bijection between and exists. We show how the Duplicator can select a “good” bijection in Step 2 of the game, by induction on the number of rounds.
To state our induction hypothesis, we first recall some notions and properties from Atserias et al. (2007) and Bova & Chen (2019).
Let be a walk in and let be an edge in . Then, denotes the number of occurrences of the edge in the walk. More precisely, if is a walk in of length , then
Furthermore, for a subset , we define
where is an arbitrary bramble of of order . A bramble is a set of connected subsets of such that for any two elements and in , either , or there exists a vertex and such that . The order of a bramble is the minimum size of a hitting set for . It is known that has treewidth if and only if it has a bramble of order . In what follows, we let be any such bramble.
Lemma 2 (Lemma 14 in Bova & Chen (2019)).
For any , let be vertices in . Let be a walk in from to . For all , let be defined by
for all . Then, the mapping , for all , is a partial isomorphism from to .∎
We use this lemma to show that the bijection (to be defined shortly) selected by the Duplicator induces a partial isomorphism between and on the pebbled vertices.
We can now state our induction hypothesis: In each round , there exists a bijection which is
-
(a)
consistent with the pebbles in the restriction of to (Recall, Pebble is selected by the Spoiler in Step 1.)
-
(b)
If for , then there exists a walk in , from to , such that
where for every . In other words, on the vertices in pebbled by , the bijection is, by the previous Lemma, a partial isomorphism from to .
If this holds, then the strategy for the Duplicator is selecting that bijection in each round.
Verification of the induction hypothesis. We assume that the special vertex in has at least two neighbours. Such a vertex exists since otherwise consists of a single edge while we assume to be of treewidth at least two.
Base case. For the base case ( we define two walks: and with and are neighhbours of . We define with if , and with .
The mapping is a bijection from to . We note that it suffices to show that is injective since and contain the same number of vertices. Since whenever , we can focus on comparing and with . This implies that for at least one edge . Clearly, this implies that . In fact this, holds for any walk and thus in particular for and . We further observe that is consistent simply because no pebbles have been placed yet. For the same reason we can take the walk to be either or .
Inductive case. Assume that the induction hypothesis holds for round and consider round . Let for . By induction, there exists a bijection such that and furthermore, for every , for some walk from to .
Assume that the Spoiler selects in Step 1 in round . We define the Duplicator’s bijection for round , as follows. Recall that is the vertex in which the walk ends.
-
•
For all such that , we define where for each :
-
•
For all , we will extend with a walk so that it ends in a vertex different from . Suppose that such that . We want to find an such that . We can then take to be a vertex in and since and are both connected, and either have a vertex in common or an edge between them, we can let be a walk from to entirely in and . Now, such an exists since otherwise would be a hitting set for of size at most . We know, however, that any hitting set must be of size because of the treewidth assumption for . We now define the bijection as where for each :
This concludes the definition of . We need to verify a couple of things: (i) is bijection; (ii) is consistent with all pebbles in except for the “unpebbled” one ; and (iii) it induces a partial isomorphism on pebbled vertices.
-
(i)
is a bijection. Since and are of the same size, it suffices to show that is an injection. Clearly, whenever . We can thus focus on and with . Then, and differ in at least one edge and for this edge:
for any walk . In particular, this holds for both walks used in the definition of : , used when , and used when . Hence, is indeed a bijection.
-
(ii)
is consistent. For each with , let . Now, by induction, ended in a vertex distinct from any of these ’s and thus none of these ’s are equal to . This implies that with . But this is precisely how placed its pebbles, by induction. Hence, and thus is consistent.
-
(iii)
induces a partial isomorphism. After the Spoiler picked an element , we now know that for all . We recall that is defined in two possible ways, using two distinct walks: , for vertices in not involving , or, otherwise using the walk , for vertices in involving .
Hence, when all ’s for are distinct from , then with and we can simply take the new walk to be . Then, Lemma 2 implies that the mapping , for is a partial isomorphism from to , as desired.
Otherwise, we know that for but . That is, the Spoiler places the th pebble on a vertex of the form in . We now have that is defined in two ways for the pebbled elements using the two distinct walks. We next show that can be used for both types of pebbled elements in , those of the form with and . For the last type this is obvious since we defined in terms of . For the former type, we note that and for . If we take an edge , then because lies entirely in and . As a consequence, for with , for all :
Then, Lemma 2 implies that the mapping , for is a partial isomorphism from to , because we can use the same walk for all pebbled vertices.
B.4 Proof of Proposition 3
We show that no finite set of patterns suffices for to be equivalent to , for , in terms of expressive power. The proof is by contradiction. That is, suppose that there exists a set such that for any two graphs and . In particular, and thus also , since the -test is upper bounded by any -test for . We argue that no finite set exists satisfying .
Let denote the maximum number of vertices of any pattern in .666Strictly speaking, we can use the diameter of any pattern in instead, but it is easier to convey the proof simply by taking number of vertices. Furthermore, consider graphs and , where is the disjoint union of copies of the cycle , and is the union of copies of the cycle . Note that and have the same number of vertices.
We observe that any homomorphism from a pattern in to or , for vertices and , maps to either a copy of (for ) or a copy of (for ). Furthermore, any such homomorphism maps in a subgraph of or , consisting of at most vertices. There is, however, a unique (up to isomorphism) subgraph of vertices in and . Indeed, such subgraphs will be a path of length . This implies that for any and . Since the argument holds for any pattern in , all vertices in and will have the same homomorphism count for patterns in . Furthermore, since both and are regular graphs (each vertex has degree two), this implies that cannot distinguish between and . This is formalised in the following lemma. We recall that a -regular graph is a graph in which every vertex has degree .
Lemma 3.
For any set of patterns and any two -regular (unlabelled) graphs and such that for , , and holds, .
Proof.
The lemma is readily verified by induction on the number of rounds of . We show a stronger result in that for any , , and , from which follows. By our Theorem 1, it suffices to show that for -pattern trees of depth at most . Let . For the base case, let be a join pattern for some . Then,
since for any . Then, for the inductive case, assume that for any -pattern tree of depth at most , , and , and consider an -pattern of depth . Let be the -pattern trees of depth at most rooted at the children of in the backbone of . As before, let the pattern joined at in . Then,
where we used that and both consists of vertices (regularity), by the induction hypothesis all vertex have the same homomorphism counts for -patterns trees of depth at most , and where and are taken to be arbitrary vertices in and , respectively. ∎
Hence, since and are -regular and satisfy the conditions of the lemma, we may indeed infer that . We note, however, that . Indeed, from Dvorak (2010) and Dell et al. (2018) we know that implies that for any graph of treewidth at most two. In particular, implies that for all cycles . We now conclude by observing that by construction. We have thus found two graphs with cannot be distinguished by , but that can be distinguished by , contradicting our assumption that .
Appendix C Proofs of Section 5
C.1 Proof of Proposition 4
We show that for any , is more expressive than . More precisely, we construct two graphs and such that and cannot be distinguished by , but they can be distinguished by .
The proof is analogous to the proof of Proposition 3. Indeed, it suffices to let consist of disjoint copies of and to consist of disjoint copies of . Then, as observed in the proof of Proposition 3, and will be indistinguishable by simply because each pattern has at most vertices. Yet, by construction, and thus and are distinguishable (already by the initial labelling) by .
C.2 Proof of Proposition 5
Let be a pattern that is the join of two smaller patterns. We show that for any any set of patterns, we have that is upper bounded by . That is, for every two graphs and , implies . By definition, is equivalent to . In other words, with every we can associate a unique such that . We show, by induction on , that this implies that . This suffices to conclude that and thus .
Base case. We show that implies that with every we can associate a unique satisfying . Indeed, as already observed, implies that with every we can associate a unique such that . This in turn implies that , which implies that and and for every . As a consequence, from properties of the graph join operators, since :
and thus also .
Inductive case. We assume that implies , and want to show that it also implies . We again use the fact that we can associate with every a unique vertex such that . In particular, this implies that and . From the definition of the -test, it must also be the case that the multisets and must be equal as well, i.e., we can find a one-to-one corresponence between neighbors of in and neighbors of in that have the same label. From the induction hypothesis we then have that and also that the multisets and are equal, which implies, by the definition of the -test, that , as was to be shown.
C.3 Proof of Theorem 3
We show that , where is pattern whose core has treewidth , is more expressive than if every pattern satisfies one of the following conditions: (i) has treewidth ; or (ii) does not map homomorphically to .
Let to denote the (rooted) core of , in which the root of is any vertex which is the image of the root of in a homomorphism from to . By assumption, has treewidth .
Clearly, is upper bounded by . Thus, all we need for the proof is to find two graphs that are indistinguishable by but are in fact distinguished by .
Those two graphs are, in fact, the graphs and constructed for (of treewidth ) in the proof of Theorem 2. From that proof, we know that:
-
(a)
and ; and
-
(b)
.
We note that (a) immediately implies that and can be distinguished by . In fact, they are distinguished in already by the initial labelling in round . We next show that and are indistinguishable by .
Let us first present a small structural result that helps us deal with patterns in satisfying the second condition of the Theorem.
Lemma 4.
If a rooted pattern does not map homomorphically to , then
Proof.
We use the following property of graphs and , which can be directly observed from their construction (and was already noted in Atserias et al. (2007) and Bova & Chen (2019)). Define and by setting as their root any vertex , for the root of . Then there is a homomorphism from to , and there is a homomorphism from to .
Now, any homomorphism from to can be extended to a homomorphism from to : we compose with the homomorphism mentioned above from to , which by definition again maps homomorphically to . Since by definition we have that does not map to , cannot exist. The proof for is analogous. ∎
Now, let be the set of patterns obtained by removing from all patterns which do not map homomorphically to . By Lemma 4, we have that and are distinguished by the -test if and only if they are distinguished by .
But all patterns in must have treewidth less than , and by (b) and are indistinguishable by . Proposition 2 then implies that and are indistinguishable by , as desired.
Appendix D Connections to related work
We here provide more details of how - connect to from the literature which also augment the initial labelling.
Vertex degrees.
We first consider so-called degree-aware (Geerts et al., 2020) in which the message functions of the may depend on the vertex degrees. The Graph Convolution Networks (s) (Kipf & Welling, 2017) are an example of such . Degree-aware are known to be equivalent, in terms of expressive power, to standard in which the initial labelling is extended with vertex degrees (Geerts et al., 2020). Translated to our setting, we can simply let since is equal to the degree of vertex in . When considering graphs without an initial vertex labelling (or a uniform labelling which assigns every vertex the same label), our characterisation (Theorem 1) implies if and only if for every -pattern tree of depth at most . This in turn is equivalent to for every (standard) tree of depth at most . Indeed, -pattern trees of depth at most are simply trees of depth . Combining this with the characterisation of by Dvorak (2010) and Dell et al. (2018), we thus have for unlabelled graphs that if and only if . So, by considering - one gains one round of computation compared to considering standard . To lift this to labeled graphs, instead of one has to include labeled versions of the single edge pattern, in order to count the number of neighbours of a specific label for each vertex. This is done, e.g., by Ishiguro et al. (2020), who use the labelling obtained after the first round to augment the initial vertex labelling. This corresponds indeed by adding as feature for every labeled tree of depth one. This results in that if and only if for labelled graphs.
Walk counts.
The Graph Feature Networks by Chen et al. (2019a) can be regarded as a generalisation of the previous approach. Instead of simply adding vertex degrees, the number of walks of certain lengths emanating from vertices are added. Translated to our setting, this corresponds to considering -, where denotes a rooted path of length . For unlabelled graphs, our characterisation (Theorem 1) implies that is upper bounded by , simply because every -pattern tree of depth is a standard tree of depth at most .
Cycles.
Li et al. (2019) extend by varying the notion of neighbourhood over which is aggregated. One particular instance corresponds to an aggregation of features, weighted by the number of cycles of a certain length in each vertex (see discussion at the end of Section 4 in Li et al. (2019)). Translated to our setting, this corresponds to considering - where denotes the cycle of length . As mentioned in the main body of the paper, these extend and result in architectures bounded by (Proposition 4). This is in line with Theorem 3 from Li et al. (2019) stating that their framework strictly extends and thus .
Isomorphism counts.
Another, albeit similar, approach to add structural information to the initial labelling is taken in the paper Graph Structure Networks by Bouritsas et al. (2020). The idea there is to extend the initial features with information about how often a vertex appears in a subgraph of which is isomorphic to . More precisely, Bouritsas et al. (2020) consider a connected unlabelled graph as pattern and partition its vertex set orbit-wise. That is, where denotes the number of orbits of . Here, whenever there is an automorphism in mapping to . Next, they consider all distinct subgraphs in which are isomorphic to , denoted by for . We write when using a specific isomorphism . Then for each orbit partition and vertex , they define:
That is, the number of subgraphs in that can be isomorphically mapped to are counted, provided that this can be done by an isomorphism which maps vertex in (and thus ) to one of the vertices in the th orbit partition of the pattern. A similar notion is proposed for edges, which we will not consider here. Similar to our extended features, the initial features of each vertex is then augmented with for some set of patterns. Standard are executed on these augmented initial features. We refer to Bouritsas et al. (2020) for more details.
We can view the above approach as an instance of our framework. Indeed, given a pattern in , for each orbit partition, we replace by a different rooted version , where is a vertex in . Which vertex in the orbit under consideration is selected as root is not important (because they are equivalent by definition of orbit). We then see that the standard notion of subgraph isomorphism counting directly translates to the quantity used in Bouritsas et al. (2020):
It thus remains to express in terms of homomorphism counts. This, however, follows from Curticapean et al. (2017) in which it is shown that can be computed by a linear combination of where ranges over all graphs on which can be mapped by means of a surjective homomorphism. For a given , the finite set of such patterns is called the spasm of in Curticapean et al. (2017) and can be easily computed.
In summary, given the set of patterns in Bouritsas et al. (2020), we first replace every by its rooted versions , for , and then expand the resulting set of rooted patterns, by the spasms of each of these patterns. Let us denote by the final set of rooted patterns. It now follows that for provides sufficient information to extract and thus also for every and orbit part . As a consequence, the from Bouritsas et al. (2020) are bounded by - and thus . Conversely, given an - one can, again using results by Curticapean et al. (2017), define a set of patterns, such that the subgraph isomorphism counts of patterns in can be used to compute the homomorphism counts of patterns in . Hence, - are upper bounded by the considered in Bouritsas et al. (2020) using patterns in . This is in agreement with Curticapean et al. (2017) in which it is shown that homomorphism counts, subgraph isomorphism counts and other notions of pattern counts are all interchangeable. Nevertheless, by using homomorphism counts one can gracefully extend known results about and , as we have shown in the paper, and add little overhead.
Appendix E Additional experimental information
E.1 Experimental setup
One of the crucial questions when studying the effect of adding structural information to the initial vertex labels is whether these additional labels enhance the performance of graph neural networks. In order to reduce the effect of specific implementation details of and choice of hyper-parameters, we start from the implementations and choices made in the benchmark by Dwivedi et al. (2020)777The original implementations can be found on https://github.com/graphdeeplearning/benchmarking-gnns. and only change the initial vertex labels, while leaving the themselves unchanged. This ensures that we only measure the effect of augmenting initial features with homomorphism counts. We will use the from the benchmark, without extended features, as our baselines. For the same reasons, we use datasets proposed in the benchmark for their ability to statistically separate the performance of . All other parameters are taken as in Dwivedi et al. (2020) and we refer to that paper for more details.
Selected
Dwivedi et al. (2020) divide the benchmarked into two classes: the and the “theoretically designed” -. The first class is found to perform stronger and train faster. Hence, we chose to include the five following models from the benchmark:
-
•
Graph Attention Network () as described in Velickovic et al. (2018)
-
•
Graph Convolutional Network () as described in Kipf & Welling (2017)
-
•
as described in Hamilton et al. (2017)
-
•
Mixed Model Convolutional Networks () as described in Monti et al. (2017)
-
•
as described in Bresson & Laurent (2017).
For we used the version in which positional encoding (Belkin & Niyogi, 2003) is added to the vertex features, as it is empirically shown to be the strongest performing version of this model by for the selected datasets (Dwivedi et al., 2020). We denote this version by , referring to the presence of edge features and this positional encoding. Details, background and a mathematical formalisation of the message passing layers of these models can be found in the supplementary material of Dwivedi et al. (2020).
As explained in the experimental section of the main paper, we enhance the vertex features with the log-normalised counts of the chosen patterns in every vertex of every graph of every dataset. The first layers of some models of (Dwivedi et al., 2020) are adapted to take in this variation in input size. All other layers where left identical to their original implementation as provided by Dwivedi et al. (2020).
Hardware, compute and resources
All models for ZINC, PATTERN and COLLAB were trained on a GeForce GTX 1080 337 Ti GPU, for CLUSTER a Tesla V100-SXM3-32GB GPU was used. Tables 7, 10, 13 and 16 report the training times for all combination of models and additional feature set. A rough estimate of the CO2 emissions based on the total computing times of reported experiments ( hours GeForce GTX 1080, hours Tesla V100-SXM3-32GB), the computing times of not-included experiments ( hours GeForce GTX 1080, hours Tesla V100-SXM3-32GB), the GPU types (GeForce GTX 1080, Tesla V100-SXM3-32GB) and the geographical location of our cluster results in a carbon emission of kg CO2 equivalent. This estimation was conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).
E.2 Graph learning tasks
We here report the full results of our experimental evaluation for graph regression (Section E.2.1), link prediction (Section E.2.2) and vertex classification (Section E.2.3) as considered in Dwivedi et al. (2020). More precisely, a full listing of the patterns and combinations used and the obtained results for the test sets can be found in Tables 5, 8, 11 and 14. Average training time (in hours) and the number of epochs are reported in Tables 7, 10, 13 and 16. Finally, the total number of model parameters are reported in Tables 6, 9, 12 and 15. All averages and standard deviations are over 4 runs with different random seeds. The main take-aways from these results are included in the main paper.
E.2.1 Graph regression with the ZINC dataset
Just as in Dwivedi et al. (2020) we use a subset (K) of ZINC molecular graphs (K) dataset (Irwin et al., 2012b) to regress a molecular property known as the constrained solubility. For each molecular graph, the vertex features are the types of heavy atoms and the edge features are the types of bonds between them. The following are taken from Dwivedi et al. (2020):
Splitting. ZINC has train, validation and test graphs.
Training.888Here and in the next tasks we are using the parameters used in the code accompanying Dwivedi et al. (2020). In the paper, slightly different parameters are used. For the learning rate strategy, an initial learning rate is set to
, the
reduce factor is , and the stopping learning rate is , the patience value is 25 and the maximal training time is set to hours.
Performance Measure The performance measure is the mean absolute error (MAE) between the
predicted and the ground truth constrained solubility for each molecular graph.
Number of layers 16 MPNN layers are used for every model.
Pattern set | |||||
---|---|---|---|---|---|
None | 0,470,02 | 0,350,01 | 0,250,01 | 0,440,01 | 0,340,05 |
0,450,01 | 0,360,01 | 0,250,00 | 0,440,00 | 0,300,01 | |
0,340,02 | 0,290,02 | 0,260,01 | 0,300,01 | 0,270,06 | |
0,440,02 | 0,340,02 | 0,230,01 | 0,420,01 | 0,270,03 | |
0,310,00 | 0,270,02 | 0,250,01 | 0,300,01 | 0,260,09 | |
0,330,01 | 0,270,01 | 0,240,02 | 0,320,01 | 0,230,03 | |
0,280,01 | 0,260,01 | 0,230,01 | 0,280,01 | 0,200,03 | |
0,240,00 | 0,210,00 | 0,200,00 | 0,250,01 | 0,160,02 | |
0,230,00 | 0,210,00 | 0,200,01 | 0,260,02 | 0,180,02 | |
(hom) | 0,220,01 | 0,200,00 | 0,190,00 | 0,23760,01 | 0,13520,01 |
(iso) | 0,240,01 | 0,220,01 | 0,160,01 | 0,24080,01 | 0,1357 0,01 |
Pattern set | |||||
---|---|---|---|---|---|
None | 358 273 | 360 742 | 388 963 | 401 148 | 408 135 |
358 417 | 360 887 | 389 071 | 401 238 | 408 205 | |
358 417 | 360 887 | 389 071 | 401 238 | 408 205 | |
358 417 | 360 887 | 389 071 | 401 238 | 408 205 | |
358 417 | 360 887 | 389 071 | 401 238 | 408 205 | |
358 561 | 361 032 | 389 179 | 401 328 | 408 275 | |
358 561 | 361 032 | 389 179 | 401 328 | 408 275 | |
358 705 | 361 177 | 389 287 | 401 418 | 408 345 | |
358 849 | 361 322 | 389 395 | 401 508 | 408 415 | |
(hom) | 359 425 | 361 902 | 389 827 | 401 868 | 408 695 |
(iso) | 359 425 | 361 902 | 389 827 | 401 868 | 408 695 |
Model: Pattern set Time Epochs Time Epochs Time Epochs Time Epochs Time Epochs None 2,40 377 10,99 463 2,46 420 1,53 345 12,08 136 2,88 444 12,03 363 2,03 500 0,91 298 12,07 148 2,30 351 11,36 324 2,31 396 1,70 382 12,06 139 2,42 375 12,03 333 1,70 444 1,06 370 12,06 202 2,40 369 9,98 421 2,58 446 1,25 288 12,08 136 2,98 461 12,03 332 2,56 458 1,41 321 12,09 132 2,76 422 12,04 319 2,67 464 1,53 356 12,06 137 2,45 381 10,13 419 1,67 463 1,04 382 12,04 229 2,65 408 10,38 420 2,09 503 1,26 364 12,08 135 (hom) 2,65 428 12,03 350 2,76 478 1,48 363 12,06 175 (iso) 2,78 497 11,72 419 2,63 547 1,58 440 11,62 148
E.2.2 Link Prediction with the Collab dataset
Another set used in Dwivedi et al. (2020) is COLLAB, a link prediction dataset proposed by the Open Graph Benchmark (OGB) (Hu et al., 2020) corresponding to a collaboration network between approximately K scientists, indexed by Microsoft Academic Graph. Vertices represent scientists and edges denote collaborations between them. For vertex features, OGB provides -dimensional vectors, obtained by averaging the word embeddings of a scientist’s papers. The year and number of co-authored papers in a given year are concatenated to form edge features. The graph can also be viewed as a dynamic multi-graph, since two vertices may have multiple temporal edges between if they collaborate over multiple years. The following are taken from Dwivedi et al. (2020):
Splitting. We use the real-life training, validation and test edge splits provided by OGB. Specifically,
they use collaborations until 2017 as training edges, those in 2018 as validation edges, and those in
2019 as test edges.
Training. All GNNs use the same learning rate strategy: an initial learning rate is set to ,
the reduce factor is , the patience value is 10, and the stopping learning rate is .
Performance Measure. We use the evaluator provided by OGB (Hu et al., 2020), which aims to measure a model’s
ability to predict future collaboration relationships given past collaborations. Specifically, they rank
each true collaboration among a set of randomly-sampled negative collaborations, and count
the ratio of positive edges that are ranked at -place or above (Hits@K). The value as this gives the best
value for statistically separating the performance of .
Number of layers 3 MPNN layers are used for every model.
Pattern set | |||||
---|---|---|---|---|---|
None | 50,320,55 | 51,361,30 | 49,811,56 | 50,330,68 | 51,002,54 |
52,870,87 | 53,570,89 | 50,181,38 | 51,100,38 | 51,570,68 | |
51,331,42 | 52,841,32 | 51,761,38 | 51,131,60 | 49,431,85 | |
52,410,89 | 54,601,01 | 50,941,30 | 51,391,23 | 50,311,59 | |
52,681,82 | 53,491,35 | 50,881,73 | 50,970,68 | 51,360,92 | |
51,811,17 | 54,321,02 | 49,940,23 | 51,011,00 | 51,111,06 |
Pattern set | |||||
---|---|---|---|---|---|
None | 25 992 | 40 479 | 39 751 | 26 487 | 27 440 |
26 049 | 40 553 | 39 804 | 26 525 | 27 475 | |
26 049 | 40 553 | 39 804 | 26 525 | 27 475 | |
26 049 | 40 553 | 39 804 | 26 525 | 27 475 | |
26 106 | 40 627 | 39 857 | 26 563 | 27 510 | |
26 163 | 40 701 | 39 910 | 26 601 | 27 545 |
Model: Pattern set Time #Epochs Time #Epochs Time #Epochs Time #Epochs Time #Epochs None 0,81 167 0,85 141 1,62 190 12,05 115,67 2,22 167 0,67 165 0,90 153 1,70 184 12,10 67,00 2,48 186 1,06 188 0,95 160 2,16 188 12,04 113,50 1,26 188 0,50 167 1,13 165 1,04 193 12,05 124,00 1,82 174 1,20 189 0,86 128 2,15 189 12,05 113,25 1,51 183 0,44 149 0,90 134 0,98 186 12,05 124,00 1,84 177
E.2.3 Vertex classification with PATTERN and CLUSTER
Finally, also used in Dwivedi et al. (2020) are the PATTERN and CLUSTER graph data sets, generated with the Stochastic Block Model (SBM) (Abbe, 2018), which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. A SBM is a random graph which assigns communities to each vertex as follows: any two vertices are connected with probability if they belong to the same community, or they are connected with probability if they belong to different communities (the value of acts as the noise level).
For the PATTERN dataset, the goal of the vertex classification problem is the detection of a certain pattern embedded in a larger graph . The graphs in consist of
communities with sizes randomly selected between . The parameters of the SBM for each community is
, , and the vertex features in are generated using a uniform random distribution
with a vocabulary of size , i.e., . Randomly, patterns composed of
vertices with intra-probability and extra-probability are generated (i.e., of vertices in are
connected to ). The vertex features for are also generated randomly using values in .
The graphs consist of - vertices. The output vertex labels have value if the vertex belongs to
and value belongs to .
For the CLUSTER dataset, the goal of the vertex classification is the detection of which cluster a vertex belongs. Here, six SBM clusters are generated with sizes randomly selected between and probabilities and . The graphs consist of - vertices. Each vertex can take an initial feature value in range . If the value is then the vertex belongs to class . If the value is , then the class of the vertex is unknown and need to be inferred. There is only one labelled vertex that is randomly assigned to each community and most vertex features are set to . The output vertex labels are defined as the community/cluster class labels.
The following are taken from Dwivedi et al. (2020):
Splitting The PATTERN dataset has train, validation and test graphs. The CLUSTER
dataset has train, validation and test graphs. We save the generated splits and use the
same sets in all models for fair comparison.
Training For all , an initial learning rate is set to , the reduce
factor is , the patience value is , and the stopping learning rate is
.
Performance measure The performance measure is the average vertex-level accuracy weighted with
respect to the class sizes.
Number of layers 16 MPNN layers are used for every model.
Pattern set | |||||
---|---|---|---|---|---|
None | 70,860,06 | 70,640,39 | 71,150,33 | 72,250,52 | 74,280,15 |
71,600,15 | 64,884,16 | 72,210,19 | 72,970,23 | 74,140,12 | |
71,400,24 | 60,642,93 | 72,140,19 | 72,570,19 | 74,160,24 | |
71,260,39 | 66,601,47 | 72,340,09 | 72,600,24 | 74,230,07 | |
71,800,28 | 50,9422,98 | 72,320,27 | 73,030,25 | 74,170,13 | |
71,630,26 | 63,033,72 | 72,320,36 | 72,650,13 | 74,030,19 |
Pattern set | |||||
---|---|---|---|---|---|
None | 395 396 | 362 849 | 399 373 | 386 835 | 406 755 |
None | 395 396 | 362 849 | 399 373 | 386 835 | 406 755 |
395 396 | 362 849 | 399 373 | 386 835 | 406 755 | |
395 548 | 362 995 | 399 463 | 386 943 | 406 825 | |
395 700 | 363 141 | 399 553 | 387 051 | 406 895 | |
395 700 | 363 141 | 399 553 | 387 051 | 406 895 | |
396 004 | 363 433 | 399 733 | 387 267 | 407 035 |
Model: GAT Pattern set Time #Epochs Time #Epochs Time #Epochs Time #Epochs Time #Epochs None 1,62 109 2,83 117 1,54 125 0,95 101 10,40 92 1,52 107 2,67 85 1,72 145 1,08 102 11,01 89 1,18 107 1,94 80 1,62 149 0,90 102 10,23 90 1,23 106 2,30 84 1,68 143 0,92 99 10,68 91 1,53 102 1,97 82 1,89 153 0,94 99 10,80 90 1,62 105 1,96 82 1,95 157 0,97 100 10,25 91
Pattern set | |||||
---|---|---|---|---|---|
None | 78,830,60 | 71,421,38 | 85,900,03 | 70,780,19 | 86,150,08 |
84,340,09 | 61,542,20 | 86,590,02 | 84,750,11 | 85,020,20 | |
84,430,40 | 63,401,55 | 86,600,02 | 84,510,06 | 85,400,28 | |
83,470,11 | 64,183,88 | 86,570,02 | 83,730,10 | 85,630,22 | |
85,440,24 | 81,292,82 | 86,580,02 | 85,850,13 | 85,800,20 | |
85,500,23 | 82,490,48 | 86,630,03 | 85,880,15 | 85,560,33 |
Pattern set | |||||
---|---|---|---|---|---|
None | 394 632 | 362 117 | 398 921 | 386 291 | 406 403 |
394 784 | 362 263 | 399 011 | 386 399 | 406 473 | |
394 784 | 362 263 | 399 011 | 386 399 | 406 473 | |
394 784 | 362 263 | 399 011 | 386 399 | 406 473 | |
394 936 | 362 409 | 399 101 | 386 507 | 406 543 | |
395 088 | 362 555 | 399 191 | 386 615 | 406 613 |
Model: GAT Pattern set Time Epochs Time Epochs Time Epochs Time Epochs Time Epochs None 1,96 87 3,41 102 1,68 116 0,77 103 10,32 101 0,97 97 2,58 80 1,42 107 0,69 105 9,12 95 0,90 90 2,68 80 1,46 106 0,67 95 9,47 94 0,89 95 2,36 80 1,26 100 0,58 98 9,14 99 2,11 91 3,62 98 1,68 108 0,86 97 9,50 87 1,02 91 3,26 94 1,48 109 0,76 102 8,84 88