This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph embedding using multi-layer adjacent point merging model

Jianming Huang Graduate School of Fundamental Science and Engineering, WASEDA University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan (e-mail: [email protected])    Hiroyuki Kasai Department of Computer Science and Communication Engineering, WASEDA University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan (e-mail: [email protected])
Abstract

For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph embedding method using a multi-layer adjacent point merging model. This embedding method allows us to extract different subgraph patterns from train-data. Then we present a flexible loss function for feature selection which enhances the robustness of our method for different classification problems. Finally, numerical evaluations demonstrate that our proposed method outperforms many state-of-the-art methods.

Published in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 [1].

1 Introduction

Graph-structured data have been used widely in various fields, such as chemoinformatics, bioinformatics, social networks, and computer vision [2, 3, 4, 5]. In graph-structured data classification tasks, many efforts have been done to define a measure between graphs, some of which are well-known as graph kernels. The graph kernels have been used widely for several decades, and are still developing rapidly in recent years. The graph kernels are kernel functions that compute similarity between two graphs, and have shown effective performances in graph classification tasks using machine learning algorithms. However, these graph kernel are not specifically focusing on the classification problem itself, but are focusing on resolving the graph and subgraph isomorphism problem, which aims to evaluate how two graphs are similar. In some graph classification problems, the graph class may not only depend on the topological similarity of the whole graph, but also subgraph patterns which is equally important. Furthermore, the key feature of subgraphs that have big impacts on classification performances are different in each individual classification problem. Nevertheless, many of these existing methods seem to neglect these points. To this end, we propose a multi-layer adjacent point pattern embedding, which can extract and select effective subgraph patterns from graphs automatically for different classification problem. Our contributions can be summarized as described below:

  • We present a multi-layer adjacent point merging model, which can extract multi-granularity representations of subgraphs, i.e., simple-to-complex subgraphs.

  • We propose a flexible loss function for feature selection, which makes our proposed method robust to different key features required in each classification problem.

Hereinafter, we represent scalars as lower-case letters (a,b,)(a,b,\ldots), and vectors as bold typeface lower-case letters (𝒂,𝒃,)(\mbox{\boldmath$a$},\mbox{\boldmath$b$},\ldots). We write n\mathbb{R}^{{n}} to denote nn-dimensional vector. A graph is a pair G=(𝒱,)G=(\mathcal{V},\mathcal{E}) consisting of a set of nn vertices (or nodes) 𝒱={v1,v2,,vn}\mathcal{V}=\{v_{1},v_{2},\ldots,v_{n}\} and a set of mm edges 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}. GG is an undirected graph if a graph GG includes only edges with no direction. The numbers of vertices and edges are, respectively, |𝒱||\mathcal{V}| and |||\mathcal{E}|. If two vertices, say vi,vj𝒱v_{i},v_{j}\in\mathcal{V}, are connected by an edge ee, then this edge is denoted as eije_{ij}. These two vertices are said to be adjacent or neighbors. We consider only undirected graphs with no self-loop.

2 Related Work

The graph classification has developed for several decades. To compute similarity between graphs in various data mining tasks, random walk kernel [6] has been developed for graph classification. This method is based on the counting of matching random walks in two graphs with a label or not. However, random walk kernel faces a difficulty by which the computational cost is O(n6)O(n^{6}) for comparing a pair of graphs in graph product space, which is a non-negligible cost, especially for large-scale graphs. To resolve this difficulty, subsequent work on Weisfeiler–Lehman graph kernel [7] has brought great success. They improved the original Weisfeiler–Lehman test using a form of multiple iteration, where neighbor patterns are aggregated.

In recent years, as effective performances of optimal transport theory [8, 9] in a machine learning domain [10], graph kernel methods are also improved greatly when combined with optimal transport theory [11]. Recent research by [12], presents a Wasserstein-based Weisfeiler–Lehman graph kernel (WWL), which maps node embedding of a Weisfeiler–Lehman pattern to a feature space, and which computes kernel values using the Wasserstein distance of two point clouds in the feature space. They received better results than those yielded by the original Weisfeiler–Lehman kernel. GOT [13] uses optimal transport differently to compute the Wasserstein distance between two normal distributions derived by graph Laplacian matrices, instead of generating walks or comparing vertex neighbors in graphs. Another attractive work by [14] raises difficulties that both the Wasserstein distance and the Gromov–Wasserstein distance are unable to accommodate the graph structure and feature information. To resolve this difficulty, they propose a notion of Fused Gromov–Wasserstein (FGW) distance, which considers both structure characteristics and feature information of two graphs.

3 Multi-layer Adjacent Point Merging

3.1 Adjacent Point Merging

3.1.1 Adjacent Point Pattern (APP)

To extract subgraph features from a graph, we start from a simple pattern of subgraph which is called adjacent point pattern (APP). The APP is composed of two part: (1) two vertices which are directly connected with each other; (2) the connecting edge between these vertices. As shown in Figure 2, viv_{i} and vjv_{j} denote two adjacent points, and eije_{ij} denotes their connecting edge. When both vertices and edge are assigned with discrete labels, there could be different APPs with different permutations of labels of vi,vjv_{i},v_{j} and eije_{ij}. Therefore, we exploit a perfect hash method [15] to assign each APP with a distinguishable label. We propose the definition of APP as:

Definition 1 (Adjacent Point Pattern).

Given two vertices vi,vjv_{i},v_{j} and the connecting edge eije_{ij} between viv_{i} and vjv_{j}. Let l:𝒱Σl:\mathcal{V}\to\Sigma denote a function that maps a vertex object v to its categorical node label assigned from a finite label alphabet Σ\Sigma, where 𝒱\mathcal{V} denotes a certain vertex set which vv belongs to. Furthermore, let w:Σw:\mathcal{E}\to\Sigma be the edge label mapping function, where \mathcal{E} denotes a certain edge set. The adjacent point pattern is defined as FAPP(vi,eij,vj)=({l(vi),l(vj)},w(eij)),F_{\rm APP}(v_{i},e_{ij},v_{j})=\left(\{l(v_{i}),l(v_{j})\},w(e_{ij})\right), where FAPP:𝒱××𝒱F_{\rm APP}:\mathcal{V}\times\mathcal{E}\times\mathcal{V}\to\mathbb{H} is the function which maps the adjacent vertices and connecting edge to a Hilbert space \mathbb{H} of the inner production of vertex label set and edge label. In the special condition in which edge eije_{ij} has no label, the vertex-only adjacent point pattern is defined as FAPP(vi,vj)=({l(vi),l(vj)}).F_{\rm APP}(v_{i},v_{j})=\left(\{l(v_{i}),l(v_{j})\}\right).

For the label assigning, we use a perfect hash function Fhash:ΣF_{\rm hash}:\mathbb{H}\to\Sigma, such that, for two APPs, 𝒙1\mbox{\boldmath$x$}_{1} and 𝒙2\mbox{\boldmath$x$}_{2}, Fhash(𝒙1)=Fhash(𝒙2)F_{\rm hash}(\mbox{\boldmath$x$}_{1})=F_{\rm hash}(\mbox{\boldmath$x$}_{2}) if and only if 𝒙1=𝒙2\mbox{\boldmath$x$}_{1}=\mbox{\boldmath$x$}_{2}. We set Fhash(𝒙i)F_{hash}(\mbox{\boldmath$x$}_{i}) as the Adjacent Point Pattern Label of 𝒙i\mbox{\boldmath$x$}_{i}.

Refer to caption
Figure 1: The structure of an APP, which consists of two directly connected vertices vi,vjv_{i},v_{j} and their connecting edge eije_{ij}. The APP do not care about the order of vi,vjv_{i},v_{j}, so (vi,eij,vj)(v_{i},e_{ij},v_{j}) and (vj,eij,vi)(v_{j},e_{ij},v_{i}) are equivalent. An APP is finally merged as a vertex, and relabeled in the APM.
Refer to caption
Figure 2: The sharing point of two APPs vs(vij,vkl)v_{s}(v_{ij},v_{kl}).

3.1.2 Adjacent Point Merging (APM)

To generate APPs in a labeled graph, we introduce an operation of adjacent point merging (APM). This operation performs merging all pairwises of vertices in a graph, which are directly connected through an edge.

Definition 2 (Adjacent Point Merging).

Given an undirected and connected graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}) with a set of vertices 𝒱={vi}i=1N\mathcal{V}=\left\{v_{i}\right\}^{N}_{i=1} and a set of edges ={eij}\mathcal{E}=\left\{e_{ij}\right\} such that |𝒱|=N|\mathcal{V}|=N and ||=M|\mathcal{E}|=M. Both vertices and edges in GG are assigned a categorical label, and l:𝒱Σl:\mathcal{V}\to\Sigma and w:Σw:\mathcal{E}\to\Sigma are the label mapping function of vertex and edge, respectively. For each vj𝒱v_{j}\in\mathcal{V}, it has 𝒜j={(vi,eij,vj):vi,vj𝒱,vi𝒩(vj),eij}\mathcal{A}_{j}=\left\{(v_{i},e_{ij},v_{j}):v_{i},v_{j}\in\mathcal{V},v_{i}\in\mathcal{N}(v_{j}),e_{ij}\in\mathcal{E}\right\}, where 𝒩(vj)\mathcal{N}(v_{j}) denotes the neighborhood of vjv_{j} and eije_{ij} is the connecting edge between viv_{i} and vjv_{j}. Then, the operation of the adjacent point merging follows steps as: (1) For each 𝒜j\mathcal{A}_{j}, we create a new vertex set 𝒱j={vij}\mathcal{V}^{\prime}_{j}=\left\{v_{ij}^{\prime}\right\}, in which a single vertex vijv_{ij}^{\prime} represent a pairwise of adjacent vertices (vi,eij,vj)(v_{i},e_{ij},v_{j}) in 𝒜j\mathcal{A}_{j}; (2) Generating 𝒱j\mathcal{V}^{\prime}_{j} for all j[N]j\in[N], we obtain 𝒱\mathcal{V}^{\prime} by removing identical vertices as 𝒱=j=1N𝒱j\mathcal{V}^{\prime}=\bigcap_{j=1}^{N}\mathcal{V}_{j}^{\prime}. (3) The new adjacent relationship between new vertices vij𝒱v_{ij}^{\prime}\in\mathcal{V}^{\prime} is defined as: vij𝒩(vkl)v_{ij}^{\prime}\in\mathcal{N}(v_{kl}^{\prime}) holds if v{vi,vj}{vk,vl}\exists v\in\left\{v_{i},v_{j}\right\}\cap\left\{v_{k},v_{l}\right\}, we call this a sharing point of vijv_{ij} and vklv_{kl} and write it vs(vij,vkl)v_{s}(v_{ij},v_{kl}), which is shown in Figure 2. Then we have a new edge set ={eij,kl}\mathcal{E}^{\prime}=\left\{e_{ij,kl}^{\prime}\right\} in which edge eij,kle_{ij,kl}^{\prime} connects vertices vijv_{ij}^{\prime} and vklv_{kl}^{\prime} directly; (4) Then we define new label mapping functions of vertex and edge as l:𝒱Σl^{\prime}:\mathcal{V}^{\prime}\to\Sigma and w:Σw^{\prime}:\mathcal{E}^{\prime}\to\Sigma, respectively, given by

l(vij)\displaystyle l^{\prime}(v_{ij}^{\prime}) =\displaystyle= Fhash(FAPP(vi,eij,vj)),\displaystyle F_{\rm hash}(F_{\rm APP}(v_{i},e_{ij},v_{j})), (1)
w(eij,kl)\displaystyle w^{\prime}(e_{ij,kl}^{\prime}) =\displaystyle= l(vs(vij,vkl)).\displaystyle l(v_{s}(v_{ij},v_{kl})). (2)

(5) Using these new components, we create a new graph GA(G)=(𝒱,)\!G_{A}(G)\!=\!(\mathcal{V}^{\prime},\mathcal{E}^{\prime})\! with new label mapping function ll^{\prime} and ww^{\prime}.

3.2 Multi-layer Adjacent Point Merging

Through the operation of APM elaborated in previous section, we are able to extract and relabel a very simple subgraph pattern. However, we find that this operation is iteratively workable. Therefore, we introduce a multi-layer structure of the adjacent point merging. The benefit of this multi-layer structure is to provide multi-granularity representations of subgraphs, i.e., simple-to-complex subgraphs, of an entire graph GG. More specifically, given an undirected and connected graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}) with categorical label assigned to vertex and edge, one iteration of the multi-layer adjacent point merging is simply written as Gi+1=GA(Gi)G_{i+1}=G_{A}(G_{i}), where GiG_{i} is the transformed graph after ii times of APM. Specially, G0G_{0} is equal to the original graph GG. As shown in Figure 3, there is an example of 2-layer adjacent point merging, which is able to extract a 3-vertices subgraph pattern.

Refer to caption
Figure 3: The structure of a 2-layer Adjacent Point Merging. The numbers in the nodes denotes the vertex index of input graph (layer 0). The graphs in each layer is the result of performing APM on the previous layer’s graph. Any vertex in ii-th layer (i1i\geq 1) represent a class of APP in input graph.

4 Adjacent Point Pattern Embedding

4.1 Embedding using All Adjacent Point Pattern Label

Based on the multi-layer APM structure, we propose the adjacent point pattern embedding (APPE) method to generate graph embedding. In a KK-layer APM, we can simply use the number of vertices which have the same label in the top-layer transformed graph GKG_{K} as element of our embedding. In this case, given a graph G={𝒱,}G=\{\mathcal{V},\mathcal{E}\} with l()l(\cdot) and w()w(\cdot) as the label mapping function of vertex and edge, respectively, we compute graph embedding through the following steps: (1) We use the KK-layer APM to transform GG. In the ii-th layer, we obtain a transformed Gi=(𝒱i,i)G_{i}=(\mathcal{V}_{i},\mathcal{E}_{i}) and new label sets Σ𝒱i,Σi\Sigma_{\mathcal{V}}^{i},\Sigma_{\mathcal{E}}^{i} of vertex and edge (Equation 1); (2) We use the output of the KK-th layer to compute the final embedding. For each discrete label LjΣ𝒱KL_{j}\in\Sigma_{\mathcal{V}}^{K}, which represent a certain subgraph pattern in the original graph GG, we count all the vertices with the same label as LjL_{j} in graph GKG_{K}. Then we obtain a vertex set j={v:l(v)=Lj,v𝒱K}\mathcal{L}_{j}=\left\{v:l^{\prime}(v)=L_{j},v\in\mathcal{V}_{K}\right\}, where l:𝒱KΣ𝒱Kl^{\prime}:\mathcal{V}_{K}\to\Sigma_{\mathcal{V}}^{K} denotes the vertex label mapping function of GKG_{K}. Using all the j\mathcal{L}_{j}, we compute our APPE of graph GG as:

𝒆(G)=[|1|,|2|,,||Σ𝒱K||]|Σ𝒱K|.\mbox{\boldmath$e$}(G)=[|\mathcal{L}_{1}|,|\mathcal{L}_{2}|,...,|\mathcal{L}_{|\Sigma_{\mathcal{V}}^{K}|}|]\in\mathbb{R}^{|\Sigma_{\mathcal{V}}^{K}|}. (3)

4.2 Feature Selection

As the number of layer increases, the size of top-layer vertex label set Σ𝒱K\Sigma_{\mathcal{V}}^{K} may become too large, which makes the size of embedding become too large. To prevent this, we perform a feature selection strategy, and only use a part of dimensions as our embedding. In order to enhance the robustness of our method, we propose a new loss function to evaluate how effective each feature is for the classification task.

For CC-classification problem with KK-layer APM, we maintain CC weight vectors as {𝒘i|Σ𝒱K|}i=1C\{\mbox{\boldmath$w$}_{i}\in\mathbb{R}^{|\Sigma_{\mathcal{V}}^{K}|}\}_{i=1}^{C}, where |Σ𝒱K||\Sigma_{\mathcal{V}}^{K}| is the size of 𝒆(G)\mbox{\boldmath$e$}(G) in Equation (3). Each weight vector 𝒘i\mbox{\boldmath$w$}_{i} corresponds to the ii-th graph class, and each dimension in 𝒘i\mbox{\boldmath$w$}_{i} corresponds to a dimension of 𝒆(G)\mbox{\boldmath$e$}(G). Given the train data graph set 𝒢\mathcal{G} and the class label mapping function y:𝒢Σ𝒢y:\mathcal{G}\to\Sigma_{\mathcal{G}} which maps a graph in 𝒢\mathcal{G} to an integer class label Σ𝒢\Sigma_{\mathcal{G}}, we perform the feature selection following the steps:

(Step.1) We first update the weight vectors 𝒘i\mbox{\boldmath$w$}_{i} by the formula: 𝒘i=G𝒢,y(G)=i𝒆(G)\mbox{\boldmath$w$}_{i}=\sum_{G\in\mathcal{G},y(G)=i}\mbox{\boldmath$e$}(G).

(Step.2) After updating all of the 𝒘i\mbox{\boldmath$w$}_{i}, we compute the final loss of each dimension 𝒘loss\mbox{\boldmath$w$}_{\rm loss} as:

𝒘loss\displaystyle\mbox{\boldmath$w$}_{\rm loss} =\displaystyle= max(Floss(1),Floss(2),,Floss(C)),\displaystyle\max\left(F_{\rm loss}(1),F_{\rm loss}(2),...,F_{\rm loss}(C)\right), (4)

where the jj-th dimension of 𝒘loss\mbox{\boldmath$w$}_{\rm loss} represents the loss of the jj-th dimension for the the ii-th graph class, and Floss(i)F_{\rm loss}(i) is: Floss(i)=(𝒘ij=1,jiC𝒘j)2j=1C𝒘j.F_{\rm loss}(i)=-\frac{(\mbox{\boldmath$w$}_{i}-\sum_{j=1,j\neq i}^{C}\mbox{\boldmath$w$}_{j})^{2}}{\sum_{j=1}^{C}\mbox{\boldmath$w$}_{j}}. It should be noted that all the calculation in the formula is element-wise calculation. In Floss(i)F_{\rm loss}(i), the first term of the numerator denotes the weight of each dimension for the ii-th graph class. And the second term of the numerator represents the weight of each dimension for graph class except ii. Therefore, the value of numerator exactly evaluates how effective each dimension is for classifying the ii-th graph class. The denominator in the formula is used for normalization. We also compute the square of the numerator in order to prevent the situation where some dimensions which has too small weight may obtain a small loss value. For example, assume that 𝒆(G)\mbox{\boldmath$e$}(G) only has one dimension, comparing the case with 𝒘i=[1]\mbox{\boldmath$w$}_{i}=[1] and j=1,jiC𝒘j=[0]\sum_{j=1,j\neq i}^{C}\mbox{\boldmath$w$}_{j}=[0], to the case with 𝒘i=[100]\mbox{\boldmath$w$}_{i}=[100] and j=1,jiC𝒘j=[1]\sum_{j=1,j\neq i}^{C}\mbox{\boldmath$w$}_{j}=[1], the former gets a smaller Floss(i)F_{\rm loss}(i) than the latter does without squaring the numerator. This is unreasonable because in the second case the dimension weights more than the first in total of i=1C𝒘i\sum_{i=1}^{C}\mbox{\boldmath$w$}_{i}.

(Step.3) We finally select the top-DD dimensions of 𝒆(G)\mbox{\boldmath$e$}(G) which have the least loss in 𝒘loss\mbox{\boldmath$w$}_{\rm loss} as a final embedding: 𝒆fin(G)=[𝒆(G)(j):𝒘loss(j)sort(𝒘loss)(D)]D\mbox{\boldmath$e$}_{\rm fin}(G)=\left[\mbox{\boldmath$e$}(G)(j):\mbox{\boldmath$w$}_{loss}(j)\leq{\rm sort}(\mbox{\boldmath$w$}_{loss})(D)\right]\in\mathbb{R}^{D}, where sort(){\rm sort}(\cdot) denotes sorting a vector in ascending order.

4.3 Computational Complexity

We present the analysis of computational complexity in this section. For 1-layer APM, it is obvious that the number of vertices in the first-layer transformed graph G1G_{1} is equal to the number of edges |||\mathcal{E}| in GG. Therefore, the computational complexity for a single graph G(𝒱,)G(\mathcal{V},\mathcal{E}) is O(||)O(|\mathcal{E}|). For 2-layer APM, the computational complexity is O(D2||)O(\sum{D^{2}}-|\mathcal{E}|), where D2\sum{D^{2}} denotes the sum of the square of the degree of each vertex, which is based on the formulation of computing number of edges of the line graph [16].

Table 1: Average classification accuracy on graph datasets (D=100D=100)
METHOD MUTAG BZR COX2 PROTEINS Mutagenicity
GK [17] 86.19±\pm6.68 79.77±\pm2.48 78.16±\pm1.17 72.05±\pm4.09 59.92±\pm1.77
RWK [6] 84.53±\pm7.79 78.53±\pm1.38 80.29±\pm2.16 - 69.03±\pm1.93
WWL [12] 79.70±\pm6.91 88.39±\pm3.88 79.65±\pm4.50 74.03±\pm5.01 80.86±\pm2.25
FGW [14] 80.26±\pm9.47 83.95±\pm2.45 78.15±\pm2.12 71.15±\pm3.63 67.87±\pm1.70
AWE [18] 83.56±\pm3.50 82.45±\pm4.66 79.00±\pm3.88 67.38±\pm2.75 73.76±\pm1.96
1-layer APM (proposed) 87.74±\pm4.79 86.64±\pm3.06 80.09±\pm2.80 75.38±\pm4.61 77.91±\pm1.49
2-layer APM (proposed) 88.80±\pm5.57 85.16±\pm3.21 82.01±\pm2.16 74.12±\pm4.84 81.50±\pm1.66
Table 2: Accuracy under different loss computings (D=10D=10)
METHOD BZR Mutagenicity
2-layer APM 86.65±\pm4.34 75.92±\pm1.94
2-layer APM (mean) 85.90±\pm3.00 74.59±\pm1.13

5 Numerical Evaluation

For numerical experiments, we evaluate the 1-layer and 2-layer APM. For feature selection, we set up a maximal dimension DD as 100100. We compare our proposed methods to five state-of-the-art methods as follows: (1) Traditional methods: the Graphlet Kernel (GK) [17] and Random Walk Kernel (RWK) [6]. For implementation of all the \mathcal{R}-convolution kernels, we use the Grakel python library [19]. (2) Recent year’s methods: Wasserstein Weisfeiler–Lehman Kernel (WWL) [12], Fused Gromov–Wasserstein Kernel (FGW) [14] and Anonymous Walk Embeddings (AWE) [18]. The parameter of HH in WWL is set as 44. The shortest path matrix is used in FGW. The stepsize of AWE is set to 33. For classification, we train a multi-class SVM classifier using one-vs.-one approach, and apply 1010 times of nested cross-validation with a 10-fold inner cross-validation. For parameter adjustment of SVM, we apply a grid search with SVM parameter CC within {0.001,0.01,0.1,1,10,100,1000}\{0.001,0.01,0.1,1,10,100,1000\}. Then we calculate an average accuracy and a standard deviation after classification.

We use several widely used benchmark real-world datasets. Among them, MUTAG [20], Mutagenicity [21] include graphs with discrete labels of both vertices and edges. BZR, COX2 [22], and PROTEINS [23] include graphs with discrete and continuous vertex attributes. All datasets above are available in TUD dataset [24]. Because our proposed method only supports graphs with discrete labels of vertices and/or edges, we remove continuous attributes from original graphs, making sure that only discrete labels remain. Table 1 shows the average classification accuracies on different graph datasets, where the top accuracy in each dataset is in bold. The results marked with “-” indicate that dataset is not applicable for objective methods. Overall, our proposed methods are shown to outperform many state-of-the-art methods in all the datasets except BZR. However, our methods still give the second-best results next to WWL. Table 2 is a condition-controlled experiment where we use a mean weight 𝒘loss=1Ci=1C𝒘i\mbox{\boldmath$w$}_{\rm loss}=\frac{1}{C}\sum_{i=1}^{C}\mbox{\boldmath$w$}_{i} to replace Equation (4), which shows the effectiveness of our loss computing. It should be noted that if DD is too large, the final embedding will include excessive dimensions so that it is hard to tell the difference between loss computings. Therefore, DD is set to 1010 in this experiment.

6 Conclusion

We have proposed a new method of computing graph embedding using a multi-layer adjacent point merging model, which extracts subgraph patterns and utilizes a flexible loss function to select effective ones of them. The numerical experiments revealed that our proposed methods have better performances than those of others. As future work, we specifically expect a more light-weight way to extracts subgraphs in high-layer models because the complexity of the current model increases rapidly as the number of layers increases.

References

  • [1] J. Huang and H. Kasai. Graph embedding using multi-layer adjacent point merging model. In ICASSP, 2021.
  • [2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. The Journal of Machine Learning Research, 11:1201–1242, 2010.
  • [3] N. M. Kriege, F. D. Johansson, and C. Morris. A survey on graph kernels. Applied Network Science, 5(1):1–42, 2020.
  • [4] R. Hashimoto and H. Kasai. Sequential semi-orthogonal multi-level nmf with negative residual reduction for network embedding. In ICASSP, 2020.
  • [5] M. Horie and H. Kasai. Consistency-aware and inconsistency-aware graph-based multi-view clustering. In EUSIPCO, 2020.
  • [6] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning theory and kernel machines, pages 129–143. Springer, 2003.
  • [7] N. Shervashidze, P. Schweitzer, E. J. v Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. The Journal of Machine Learning Research, 12:2539–2561, 2011.
  • [8] C. Villani. Optimal transport: Old and new. Springer, New York, 2008.
  • [9] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
  • [10] H. Kasai. Multi-view Wasserstein discriminant analysis with entropic regularized Wasserstein distance. In ICASSP, 2020.
  • [11] J. Huang, Z. Fang, and H. Kasai. LCS graph kernel based on Wasserstein distance in longest common subsequence metric space. arXiv preprint arXiv:2012.03612, 2020.
  • [12] M. Togninalli, E. Ghisu, F. Llinares-López, B. Rieck, and K. Borgwardt. Wasserstein weisfeiler-lehman graph kernels. In NeurIPS, 2019.
  • [13] H. P. Maretic, M. El Gheche, G. Chierchia, and P. Frossard. GOT: an optimal transport framework for graph comparison. In NeurIPS, 2019.
  • [14] V. Titouan, N. Courty, R. Tavenard, and R. Flamary. Optimal transport for structured data with application on graphs. In ICML, 2019.
  • [15] R. J. Cichelli. Minimal perfect hash functions made simple. Communications of the ACM, 23(1):17–19, 1980.
  • [16] D. H. Younger. Graph theory (frank harary). SIAM Review, 14(2):350, 1972.
  • [17] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, 2009.
  • [18] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint arXiv:1805.11921, 2018.
  • [19] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgianis. GraKeL: A Graph Kernel Library in Python. Journal of Machine Learning Research, 21:1–5, 2020.
  • [20] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J Shusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
  • [21] K. Riesen and H. Bunke. Iam graph database repository for graph based pattern recognition and machine learning. In SPR and SSPR, 2008.
  • [22] J. J. Sutherland, L. A. O’brien, and D. F. Weaver. Spline-fitting with a genetic algorithm: A method for developing classification structure- activity relationships. Journal of chemical information and computer sciences, 43(6):1906–1915, 2003.
  • [23] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
  • [24] M. Christopher, M. K. Nils, B. Franka, K. Kristian, M. Petra, and N. Marion. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020.