Graph embedding using multi-layer adjacent point merging model
Abstract
For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph embedding method using a multi-layer adjacent point merging model. This embedding method allows us to extract different subgraph patterns from train-data. Then we present a flexible loss function for feature selection which enhances the robustness of our method for different classification problems. Finally, numerical evaluations demonstrate that our proposed method outperforms many state-of-the-art methods.
Published in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 [1].
1 Introduction
Graph-structured data have been used widely in various fields, such as chemoinformatics, bioinformatics, social networks, and computer vision [2, 3, 4, 5]. In graph-structured data classification tasks, many efforts have been done to define a measure between graphs, some of which are well-known as graph kernels. The graph kernels have been used widely for several decades, and are still developing rapidly in recent years. The graph kernels are kernel functions that compute similarity between two graphs, and have shown effective performances in graph classification tasks using machine learning algorithms. However, these graph kernel are not specifically focusing on the classification problem itself, but are focusing on resolving the graph and subgraph isomorphism problem, which aims to evaluate how two graphs are similar. In some graph classification problems, the graph class may not only depend on the topological similarity of the whole graph, but also subgraph patterns which is equally important. Furthermore, the key feature of subgraphs that have big impacts on classification performances are different in each individual classification problem. Nevertheless, many of these existing methods seem to neglect these points. To this end, we propose a multi-layer adjacent point pattern embedding, which can extract and select effective subgraph patterns from graphs automatically for different classification problem. Our contributions can be summarized as described below:
-
•
We present a multi-layer adjacent point merging model, which can extract multi-granularity representations of subgraphs, i.e., simple-to-complex subgraphs.
-
•
We propose a flexible loss function for feature selection, which makes our proposed method robust to different key features required in each classification problem.
Hereinafter, we represent scalars as lower-case letters , and vectors as bold typeface lower-case letters . We write to denote -dimensional vector. A graph is a pair consisting of a set of vertices (or nodes) and a set of edges . is an undirected graph if a graph includes only edges with no direction. The numbers of vertices and edges are, respectively, and . If two vertices, say , are connected by an edge , then this edge is denoted as . These two vertices are said to be adjacent or neighbors. We consider only undirected graphs with no self-loop.
2 Related Work
The graph classification has developed for several decades. To compute similarity between graphs in various data mining tasks, random walk kernel [6] has been developed for graph classification. This method is based on the counting of matching random walks in two graphs with a label or not. However, random walk kernel faces a difficulty by which the computational cost is for comparing a pair of graphs in graph product space, which is a non-negligible cost, especially for large-scale graphs. To resolve this difficulty, subsequent work on Weisfeiler–Lehman graph kernel [7] has brought great success. They improved the original Weisfeiler–Lehman test using a form of multiple iteration, where neighbor patterns are aggregated.
In recent years, as effective performances of optimal transport theory [8, 9] in a machine learning domain [10], graph kernel methods are also improved greatly when combined with optimal transport theory [11]. Recent research by [12], presents a Wasserstein-based Weisfeiler–Lehman graph kernel (WWL), which maps node embedding of a Weisfeiler–Lehman pattern to a feature space, and which computes kernel values using the Wasserstein distance of two point clouds in the feature space. They received better results than those yielded by the original Weisfeiler–Lehman kernel. GOT [13] uses optimal transport differently to compute the Wasserstein distance between two normal distributions derived by graph Laplacian matrices, instead of generating walks or comparing vertex neighbors in graphs. Another attractive work by [14] raises difficulties that both the Wasserstein distance and the Gromov–Wasserstein distance are unable to accommodate the graph structure and feature information. To resolve this difficulty, they propose a notion of Fused Gromov–Wasserstein (FGW) distance, which considers both structure characteristics and feature information of two graphs.
3 Multi-layer Adjacent Point Merging
3.1 Adjacent Point Merging
3.1.1 Adjacent Point Pattern (APP)
To extract subgraph features from a graph, we start from a simple pattern of subgraph which is called adjacent point pattern (APP). The APP is composed of two part: (1) two vertices which are directly connected with each other; (2) the connecting edge between these vertices. As shown in Figure 2, and denote two adjacent points, and denotes their connecting edge. When both vertices and edge are assigned with discrete labels, there could be different APPs with different permutations of labels of and . Therefore, we exploit a perfect hash method [15] to assign each APP with a distinguishable label. We propose the definition of APP as:
Definition 1 (Adjacent Point Pattern).
Given two vertices and the connecting edge between and . Let denote a function that maps a vertex object v to its categorical node label assigned from a finite label alphabet , where denotes a certain vertex set which belongs to. Furthermore, let be the edge label mapping function, where denotes a certain edge set. The adjacent point pattern is defined as where is the function which maps the adjacent vertices and connecting edge to a Hilbert space of the inner production of vertex label set and edge label. In the special condition in which edge has no label, the vertex-only adjacent point pattern is defined as
For the label assigning, we use a perfect hash function , such that, for two APPs, and , if and only if . We set as the Adjacent Point Pattern Label of .


3.1.2 Adjacent Point Merging (APM)
To generate APPs in a labeled graph, we introduce an operation of adjacent point merging (APM). This operation performs merging all pairwises of vertices in a graph, which are directly connected through an edge.
Definition 2 (Adjacent Point Merging).
Given an undirected and connected graph with a set of vertices and a set of edges such that and . Both vertices and edges in are assigned a categorical label, and and are the label mapping function of vertex and edge, respectively. For each , it has , where denotes the neighborhood of and is the connecting edge between and . Then, the operation of the adjacent point merging follows steps as: (1) For each , we create a new vertex set , in which a single vertex represent a pairwise of adjacent vertices in ; (2) Generating for all , we obtain by removing identical vertices as . (3) The new adjacent relationship between new vertices is defined as: holds if , we call this a sharing point of and and write it , which is shown in Figure 2. Then we have a new edge set in which edge connects vertices and directly; (4) Then we define new label mapping functions of vertex and edge as and , respectively, given by
(1) | |||||
(2) |
(5) Using these new components, we create a new graph with new label mapping function and .
3.2 Multi-layer Adjacent Point Merging
Through the operation of APM elaborated in previous section, we are able to extract and relabel a very simple subgraph pattern. However, we find that this operation is iteratively workable. Therefore, we introduce a multi-layer structure of the adjacent point merging. The benefit of this multi-layer structure is to provide multi-granularity representations of subgraphs, i.e., simple-to-complex subgraphs, of an entire graph . More specifically, given an undirected and connected graph with categorical label assigned to vertex and edge, one iteration of the multi-layer adjacent point merging is simply written as , where is the transformed graph after times of APM. Specially, is equal to the original graph . As shown in Figure 3, there is an example of 2-layer adjacent point merging, which is able to extract a 3-vertices subgraph pattern.

4 Adjacent Point Pattern Embedding
4.1 Embedding using All Adjacent Point Pattern Label
Based on the multi-layer APM structure, we propose the adjacent point pattern embedding (APPE) method to generate graph embedding. In a -layer APM, we can simply use the number of vertices which have the same label in the top-layer transformed graph as element of our embedding. In this case, given a graph with and as the label mapping function of vertex and edge, respectively, we compute graph embedding through the following steps: (1) We use the -layer APM to transform . In the -th layer, we obtain a transformed and new label sets of vertex and edge (Equation 1); (2) We use the output of the -th layer to compute the final embedding. For each discrete label , which represent a certain subgraph pattern in the original graph , we count all the vertices with the same label as in graph . Then we obtain a vertex set , where denotes the vertex label mapping function of . Using all the , we compute our APPE of graph as:
(3) |
4.2 Feature Selection
As the number of layer increases, the size of top-layer vertex label set may become too large, which makes the size of embedding become too large. To prevent this, we perform a feature selection strategy, and only use a part of dimensions as our embedding. In order to enhance the robustness of our method, we propose a new loss function to evaluate how effective each feature is for the classification task.
For -classification problem with -layer APM, we maintain weight vectors as , where is the size of in Equation (3). Each weight vector corresponds to the -th graph class, and each dimension in corresponds to a dimension of . Given the train data graph set and the class label mapping function which maps a graph in to an integer class label , we perform the feature selection following the steps:
(Step.1) We first update the weight vectors by the formula: .
(Step.2) After updating all of the , we compute the final loss of each dimension as:
(4) |
where the -th dimension of represents the loss of the -th dimension for the the -th graph class, and is: It should be noted that all the calculation in the formula is element-wise calculation. In , the first term of the numerator denotes the weight of each dimension for the -th graph class. And the second term of the numerator represents the weight of each dimension for graph class except . Therefore, the value of numerator exactly evaluates how effective each dimension is for classifying the -th graph class. The denominator in the formula is used for normalization. We also compute the square of the numerator in order to prevent the situation where some dimensions which has too small weight may obtain a small loss value. For example, assume that only has one dimension, comparing the case with and , to the case with and , the former gets a smaller than the latter does without squaring the numerator. This is unreasonable because in the second case the dimension weights more than the first in total of .
(Step.3) We finally select the top- dimensions of which have the least loss in as a final embedding: , where denotes sorting a vector in ascending order.
4.3 Computational Complexity
We present the analysis of computational complexity in this section. For 1-layer APM, it is obvious that the number of vertices in the first-layer transformed graph is equal to the number of edges in . Therefore, the computational complexity for a single graph is . For 2-layer APM, the computational complexity is , where denotes the sum of the square of the degree of each vertex, which is based on the formulation of computing number of edges of the line graph [16].
METHOD | MUTAG | BZR | COX2 | PROTEINS | Mutagenicity |
---|---|---|---|---|---|
GK [17] | 86.196.68 | 79.772.48 | 78.161.17 | 72.054.09 | 59.921.77 |
RWK [6] | 84.537.79 | 78.531.38 | 80.292.16 | - | 69.031.93 |
WWL [12] | 79.706.91 | 88.393.88 | 79.654.50 | 74.035.01 | 80.862.25 |
FGW [14] | 80.269.47 | 83.952.45 | 78.152.12 | 71.153.63 | 67.871.70 |
AWE [18] | 83.563.50 | 82.454.66 | 79.003.88 | 67.382.75 | 73.761.96 |
1-layer APM (proposed) | 87.744.79 | 86.643.06 | 80.092.80 | 75.384.61 | 77.911.49 |
2-layer APM (proposed) | 88.805.57 | 85.163.21 | 82.012.16 | 74.124.84 | 81.501.66 |
METHOD | BZR | Mutagenicity |
---|---|---|
2-layer APM | 86.654.34 | 75.921.94 |
2-layer APM (mean) | 85.903.00 | 74.591.13 |
5 Numerical Evaluation
For numerical experiments, we evaluate the 1-layer and 2-layer APM. For feature selection, we set up a maximal dimension as . We compare our proposed methods to five state-of-the-art methods as follows: (1) Traditional methods: the Graphlet Kernel (GK) [17] and Random Walk Kernel (RWK) [6]. For implementation of all the -convolution kernels, we use the Grakel python library [19]. (2) Recent year’s methods: Wasserstein Weisfeiler–Lehman Kernel (WWL) [12], Fused Gromov–Wasserstein Kernel (FGW) [14] and Anonymous Walk Embeddings (AWE) [18]. The parameter of in WWL is set as . The shortest path matrix is used in FGW. The stepsize of AWE is set to . For classification, we train a multi-class SVM classifier using one-vs.-one approach, and apply times of nested cross-validation with a 10-fold inner cross-validation. For parameter adjustment of SVM, we apply a grid search with SVM parameter within . Then we calculate an average accuracy and a standard deviation after classification.
We use several widely used benchmark real-world datasets. Among them, MUTAG [20], Mutagenicity [21] include graphs with discrete labels of both vertices and edges. BZR, COX2 [22], and PROTEINS [23] include graphs with discrete and continuous vertex attributes. All datasets above are available in TUD dataset [24]. Because our proposed method only supports graphs with discrete labels of vertices and/or edges, we remove continuous attributes from original graphs, making sure that only discrete labels remain. Table 1 shows the average classification accuracies on different graph datasets, where the top accuracy in each dataset is in bold. The results marked with “-” indicate that dataset is not applicable for objective methods. Overall, our proposed methods are shown to outperform many state-of-the-art methods in all the datasets except BZR. However, our methods still give the second-best results next to WWL. Table 2 is a condition-controlled experiment where we use a mean weight to replace Equation (4), which shows the effectiveness of our loss computing. It should be noted that if is too large, the final embedding will include excessive dimensions so that it is hard to tell the difference between loss computings. Therefore, is set to in this experiment.
6 Conclusion
We have proposed a new method of computing graph embedding using a multi-layer adjacent point merging model, which extracts subgraph patterns and utilizes a flexible loss function to select effective ones of them. The numerical experiments revealed that our proposed methods have better performances than those of others. As future work, we specifically expect a more light-weight way to extracts subgraphs in high-layer models because the complexity of the current model increases rapidly as the number of layers increases.
References
- [1] J. Huang and H. Kasai. Graph embedding using multi-layer adjacent point merging model. In ICASSP, 2021.
- [2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. The Journal of Machine Learning Research, 11:1201–1242, 2010.
- [3] N. M. Kriege, F. D. Johansson, and C. Morris. A survey on graph kernels. Applied Network Science, 5(1):1–42, 2020.
- [4] R. Hashimoto and H. Kasai. Sequential semi-orthogonal multi-level nmf with negative residual reduction for network embedding. In ICASSP, 2020.
- [5] M. Horie and H. Kasai. Consistency-aware and inconsistency-aware graph-based multi-view clustering. In EUSIPCO, 2020.
- [6] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning theory and kernel machines, pages 129–143. Springer, 2003.
- [7] N. Shervashidze, P. Schweitzer, E. J. v Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. The Journal of Machine Learning Research, 12:2539–2561, 2011.
- [8] C. Villani. Optimal transport: Old and new. Springer, New York, 2008.
- [9] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
- [10] H. Kasai. Multi-view Wasserstein discriminant analysis with entropic regularized Wasserstein distance. In ICASSP, 2020.
- [11] J. Huang, Z. Fang, and H. Kasai. LCS graph kernel based on Wasserstein distance in longest common subsequence metric space. arXiv preprint arXiv:2012.03612, 2020.
- [12] M. Togninalli, E. Ghisu, F. Llinares-López, B. Rieck, and K. Borgwardt. Wasserstein weisfeiler-lehman graph kernels. In NeurIPS, 2019.
- [13] H. P. Maretic, M. El Gheche, G. Chierchia, and P. Frossard. GOT: an optimal transport framework for graph comparison. In NeurIPS, 2019.
- [14] V. Titouan, N. Courty, R. Tavenard, and R. Flamary. Optimal transport for structured data with application on graphs. In ICML, 2019.
- [15] R. J. Cichelli. Minimal perfect hash functions made simple. Communications of the ACM, 23(1):17–19, 1980.
- [16] D. H. Younger. Graph theory (frank harary). SIAM Review, 14(2):350, 1972.
- [17] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, 2009.
- [18] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint arXiv:1805.11921, 2018.
- [19] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgianis. GraKeL: A Graph Kernel Library in Python. Journal of Machine Learning Research, 21:1–5, 2020.
- [20] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J Shusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
- [21] K. Riesen and H. Bunke. Iam graph database repository for graph based pattern recognition and machine learning. In SPR and SSPR, 2008.
- [22] J. J. Sutherland, L. A. O’brien, and D. F. Weaver. Spline-fitting with a genetic algorithm: A method for developing classification structure- activity relationships. Journal of chemical information and computer sciences, 43(6):1906–1915, 2003.
- [23] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
- [24] M. Christopher, M. K. Nils, B. Franka, K. Kristian, M. Petra, and N. Marion. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020.