Learning Attributed Graphlets:
Predictive Graph Mining by Graphlets with Trainable Attribute
Abstract
The graph classification problem has been widely studied; however, achieving an interpretable model with high predictive performance remains a challenging issue. This paper proposes an interpretable classification algorithm for attributed graph data, called LAGRA (Learning Attributed GRAphlets). LAGRA learns importance weights for small attributed subgraphs, called attributed graphlets (AGs), while simultaneously optimizing their attribute vectors. This enables us to obtain a combination of subgraph structures and their attribute vectors that strongly contribute to discriminating different classes. A significant characteristics of LAGRA is that all the subgraph structures in the training dataset can be considered as a candidate structures of AGs. This approach can explore all the potentially important subgraphs exhaustively, but obviously, a naïve implementation can require a large amount of computations. To mitigate this issue, we propose an efficient pruning strategy by combining the proximal gradient descent and a graph mining tree search. Our pruning strategy can ensure that the quality of the solution is maintained compared to the result without pruning. We empirically demonstrate that LAGRA has superior or comparable prediction performance to the standard existing algorithms including graph neural networks, while using only a small number of AGs in an interpretable manner.
1 Introduction
Prediction problems with a graph input, such as graph classification problems, have been widely studied in the data science community. A graph representation is useful to capture structural data, and graph-based machine learning algorithms have been applied to variety of application problems such as chemical composition analysis [1, 2] and crystal structure analysis [3, 4]. In real-word datasets, graphs often have node attributes as a continuous value vector (note that we only focus on node attributes throughout the paper, but the discussion is same for edge attributes). For example, a graph created by a chemical composition can have a three dimensional position of each atom as an attribute vector in addition to a categorical label such as atomic species. In this paper, we consider building an interepretable prediction model for a graph classification problem in which an input graph has continuous attribute vectors. As we will see in Section 3, this setting has not been widely studied despite its practical importance.
Our framework can identify important small subgraphs, called graphlets, in which each node has an attribute vector. Note that we use the term graphlet simply to denote a small connected subgraph [5], though in some papers, it only indicates induced subgraphs [6]. Figure 1 shows an illustration of our prediction model. In the figure, the output of the prediction model for an input attributed graph is defined through a linear combination of attributed graphlets (AGs), represented as , each one of which is weighted by parameters . The function evaluates a matching score between and an AG in a sense that how precisely contains the AG . We apply a sparse regularization to the parameter by which a small number of important AGs for classification can be obtained, i.e., an AG with non-zero (in particular, if it has large ) can be regarded as a discriminative AG.


Important subgraphs and attribute vectors are usually unknown beforehand. The basic strategy of our proposed method, called LAGRA (Learning Attributed GRAphlets), is as follows:
-
•
To explore potentially important substructures of graphs, i.e., subgraphs, LAGRA uses graph mining by which all the subgraphs in the given dataset can be considered up to the given maximum graph size.
-
•
For continuous attributes in AGs, we optimize them as trainable parameters.
Figure 2 is an example of identified AGs by LAGRA, which shows only two AGs clearly separate two classes (See Section 4.2 for detail). Since the number of the possible subgraphs is quite large and an attribute vector exists for each node in each one of subgraphs, a naïve implementation becomes computationally intractable. For the efficient optimization, we employ a block coordinate update [7] based approach in which (a vector containing ), the bias term , and attribute vectors are alternately updated. In the alternate update of , we apply the proximal gradient descent [8, 9], which is known as an effective algorithm to optimize sparse parameters. For this step, we propose an efficient pruning strategy, enabling us to identify dimensions that are not required to update at that iteration. This pruning strategy has the three advantages. First, by combining the sparsity during the proximal update and the graph mining tree search, we can eliminate unnecessary dimensions without enumerating all the possible subgraphs. Second, for removed variables at that iteration, attribute vectors in are also not required to be updated, which also accelerates the optimization. Third, our pruning strategy is designed so that it can maintain the update result compared with when we do not perform the pruning (In other words, our pruning strategy does not deteriorate the resulting model accuracy).
Our contributions are summarized as follows:
-
•
We propose an interpretable graph classification model, in which the prediction is defined through a linear combination of graphlets that have trainable attribute vectors. By imposing a sparse penalty on the coefficient of each AG, a small number of important AGs can be identified.
-
•
To avoid directly handling an intractably large size of optimization variables, we propose an efficient pruning strategy based on the proximal gradient descent, which can safely ignore AGs that do not contribute to the update.
-
•
We verify effectiveness of LAGRA by empirical evaluations. Although our prediction model is simple and interpretable, we show that prediction performance of LAGRA was superior to or comparable with well-known standard graph classification methods, and in those results, LAGRA actually only used a small number of AGs. Further, we also show examples of selected AGs to demonstrate the high interpretability.
2 Proposed Method: LAGRA
In this section, we describe our proposed method, called Learning Attributed GRAphlets (LAGRA). First, in Section 2.1, we show the formulation of our model and the definition of the optimization problem. Second, in Section 2.2, we show an efficient optimization algorithm for LAGRA.
2.1 Formulation
2.1.1 Problem Setting
We consider a classification problem in which a graph is an input. A set of nodes and edges of are written as and , respectively. Each one of nodes has a categorical label and a continuous attribute vector , where is an attribute dimension. In this paper, an attribute indicates a continuous attribute vector. We assume that a label and an attribute vector are for a node, but the discussion in this paper is completely same as for an edge label and attribute. A training dataset is , in which is a binary label and is the dataset size, where . Although we only focus on the classification problem, our framework is also applicable to the regression problem just by replacing the loss function.
2.1.2 Attributed Graphlet Inclusion Score
We consider extracting important small attributed graphs, which we call attributed graphlets (AGs), that contributes to the classification boundary. Note that throughout the paper, we only consider a connected graph as an AG for a better interpretability (do not consider an AG by a disconnected graph). Let be a feature representing a degree that an input graph includes an AG . We refer to as the AG inclusion score (AGIS). Our proposed LAGRA identifies important AGs by applying a feature selection to a model with this AGIS feature.
Suppose that is a labeled graph having a categorical label for each node, and in , an attribute for each node is excluded from . We define AGIS so that it has a non-zero value only when is included in :
(1) |
where means that is a subgraph of , and is a function that provides a continuous inclusion score of in . The condition makes AGIS highly interpretable. For example, in the case of chemical composition data, if represents O-C (oxygen and carbon are connected) and , then we can guarantee that must contain O-C. Figure 3(a) shows examples of and .
The function needs to be defined so that it can represent how strongly the attribute vectors in can be matched to those of . When , there exists at least one injection in which for preserves node labels and edges among . Figure 3(b) shows an example of when there exist two injections. Let be a set of possible injections . We define a similarity between and a subgraph of matched by as follows
where is a fixed parameter that adjusts the length scale. In , the sum of squared distances of attribute vectors between matched nodes are taken. To use this similarity in AGIS (1), we take the maximum among all the matchings :
(2) |
An intuition behind (2) is that it evaluates inclusion of in based on the best macthing in a sense of for . If and there exists such that for , then, takes the maximum value (i.e., ).

2.1.3 Model definition
Our prediction model linearly combines the feature as follows:
(3) |
where and are parameters, is a set of candidate AGs, and and are vectors containing and for , respectively. Let be a set of all the labeled subgraphs contained in the training input graphs , where is the number of nodes in the labeled graph and maxpat is the user-specified maximum size of AGs. The number of the candidate AGs is set as the same size as . We set as a set of attributed graphs created by giving trainable attribute vectors to each one of elements in . Figure 4 shows a toy example. Our optimization problem for , and is defined as the following regularized loss minimization in which the sparse penalty is imposed on :
(4) |
where , and is a convex differentiable loss function. Here, we employ the squared hinge loss function [10]:
Since the objective function (4) induces a sparse solution for , we can identify a small number of important AGs as having the non-zero . However, this optimization problem has an intractably large number of optimization variables ( contains all the possible subgraphs in the training dataset and each one of has the attribute vector for each one of nodes). We propose an efficient optimization algorithm that mitigates this problem.


2.2 Optimization
Our optimization algorithm is based on the block coordinate update [7] algorithm, in which the (proximal) gradient descent alternately updates a block of variables. We update one of and alternately, while the other two parameters are fixed. First, the proximal gradient update is applied to because it has the penalty. Second, for , we calculate the optimal solution under fixing the other variable because it is easy to obtain. Third, for , we apply the usual gradient descent update because it does not have sparse penalty.
The difficulty of the optimization problem (4) originates from the size of . We select a small size of a subset , and only , defined by for , and corresponding attribute vectors are updated. We propose an efficient pruning strategy by combining the proximal gradient with the graph mining, which enables us to select without enumerating all the possible subgraphs. A notable characteristics of this approach is that it can obtain the completely same result compared with when we do not restrict the size of variables.
2.2.1 Update and
Before introducing the subset , we first describe update rules of each variable. First, we apply the proximal gradient update to . Let
be the derivative of the loss term in (4) with respect to . Then, the update of is defined by
(5) |
where is a step length, and
is a proximal operator (Note that the proximal gradient for the penalty is often called ISTA [9], for which an accelerated variant called FISTA is also known. We here employ ISTA for simplicity). We select the step length by the standard backtrack search.
The bias term is update by
which is the optimal solution of the original problem (4) for given other variables and . Since the objective function of is a differential convex function, the update rule of can be derived from the first order condition as
(6) |
where . This update rule contains in . However, it is easy to calculate the update (6) without knowing beforehand. Here, we omit detail because it is a simple one dimensional problem (see supplementary appendix A).
For , we employ the standard gradient descent:
(7) |
where is a step length to which we apply the standard backtrack search.
2.2.2 Gradient Pruning with Graph Mining
In every update of , we incrementally add required into . For the complement set , which contains AGs that have never been updated, we initialize for . For the initialization of a node attribute vector , we set the same initial vector if the node (categorical) labels are same, i.e., if for (in practice, we use the average of the attribute vectors within each node label). This constraint is required for our pruning criterion, but it is only for initial values. After the update (7), all can have different values.
Since for , it is easy to derive the following relation from the proximal update (5):
(8) |
This indicates that if the conditions in the left side hold, we do not need to update because it remains . Therefore, we set
(9) |
and apply the update (5) only to . However, evaluating for all can be computationally intractable because it needs to enumerate all the possible subgraphs. The following theorem can be used to avoid this difficulty:
Theorem 2.1
Let and . Then,
where
where .
See supplementary appendix B for the proof. Note that here is defined by the current unlike (6). This theorem indicates that the gradient for any whose contains as a subgraph can be bounded by . It should be noted that can be calculated without generating , and it mainly needs only the model prediction with the current parameter , which can be immediately obtained at each iteration, and AGIS . The rule (8) reveals that, to identify , we only require to know whether holds, and thus, an important consequence of theorem 2.1 is the following rule:
(10) |
Therefore, if the conditions in the first line in (10) hold, any whose contains as a subgraph can be discarded during that iteration. Further, from (7), we can immediately see that attribute vectors for are also not necessary to be updated if . This is an important fact because updates of a large number of variables can be omitted.
Figure 5 shows an illustration of the forward and backward (gradient) computations of LAGRA. For the gradient pruning, an efficient algorithm can be constructed by combining the rule (10) and a graph mining algorithm. A well-known efficient graph mining algorithm is gSpan [11], which creates the tree by recursively expanding each graph in the tree node as far as the expanded graph is included in a given set of graphs as a subgraph. An important characteristics of the mining tree is that all graphs must contain any graph of its ancestors as subgraphs. Therefore, during the tree traverse (depth-first search) by gSpan, we can prune the entire subtree (all descendant nodes) if holds for the AG in a tree node (Figure 5(b)). This means that we can update by (9) without exhaustively investigating all the elements in .

gSpan has another advantage for LAGRA. To calculate the feature , defined in (2), LAGRA requires a set of injections (an example is shown in Figure 3(b)). gSpan keeps the information of during the tree traverse because it is required to expand a subgraph in each (see the authors implementation https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm). Therefore, we can directly use created by gSpan to calculate (2).
2.2.3 Algorithm
We here describe entire procedure of the optimization of LAGRA. We employ the so-called regularization path following algorithm (e.g., [12]), in which the algorithm starts from a large value of the regularization parameter and gradually decreases it while solving the problem for each . This strategy can start from highly sparse , in which usually also becomes small. Further, at each , the solution obtained in the previous , can be used as the initial value by which faster convergence can be expected (so-called warm start).
Algorithm 1 shows the procedure of the regularization path following. We initialize , which is obviously optimal when . In line 2 of Algorithm 1, we calculate at which starts having non-zero values: , where . See supplementary appendix C for derivation. can also be written as . To find , we can use almost the same gSpan based pruning strategy by using an upper bound of as shown in Section 2.2.2 (the only difference is to search the value only, instead of searching all satisfying ), though in Algorithm 1, this process is omitted for brevity. After setting , the regularization parameter is decreased by using a pre-defined decreasing factor as shown in line 6 of Algorithm 1. For each , the parameters and are alternately updated as described in Section 2.2.1 and 2.2.2. We stop the alternate update by monitoring performance on the validation dataset in line 14 (stop by thresholding the decrease of the objective function is also possible).
The algorithm of the pruning strategy described in Section 2.2.2 is shown in Algorithm 2. This function recursively traverses the graph mining tree. At each tree node, first, is evaluated to prune the subtree if possible. Then, if , is included in . The expansion from (creating children of the graph tree) is performed by gSpan, by which only the subgraphs contained in the training set can be generated (see the original paper [11] for detail of gSpan). The initialization of the trainable attribute is performed when is expanded (line 15).
AIDS | BZR | COX2 | DHFR | ENZYMES | PROTEINS | SYNTHETIC | # best | |
GH | 0.9985 0.0020 | 0.8458 0.0327 | 0.7872 0.0252 | 0.7250 0.0113 | 0.6050 0.0857 | 0.7277 0.0332 | 0.6767 0.0655 | 2 |
ML | 0.9630 0.0062 | 0.8289 0.0141 | 0.7787 0.0080 | 0.7105 0.0300 | 0.6000 0.0652 | 0.6205 0.0335 | 0.4867 0.0356 | 0 |
PA | 0.9805 0.0086 | 0.8313 0.0076 | 0.7809 0.0144 | 0.7316 0.0435 | 0.7500 0.0758 | 0.6884 0.0077 | 0.5400 0.0859 | 1 |
DGCNN | 0.9830 0.0046 | 0.8169 0.0177 | 0.8021 0.0401 | 0.7289 0.0192 | 0.7289 0.0192 | 0.7509 0.0114 | 0.9867 0.0125 | 3 |
GCN | 0.9840 0.0030 | 0.8290 0.0460 | 0.8340 0.0257 | 0.7490 0.0312 | 0.7000 0.0837 | 0.6880 0.0202 | 0.9630 0.0194 | 2 |
GAT | 0.9880 0.0041 | 0.8220 0.0336 | 0.7830 0.0274 | 0.7110 0.0156 | 0.7100 0.0768 | 0.7160 0.0108 | 0.9800 0.0267 | 2 |
LAGRA (Proposed) | 0.9900 0.0050 | 0.8892 0.0207 | 0.8043 0.0229 | 0.8171 0.0113 | 0.6450 0.0797 | 0.7491 0.0142 | 1.0000 0.0000 | 4 |
# non-zero | 50.4 17.1 | 52.4 19.0 | 45.4 14.9 | 40.0 11.6 | 7.2 8.4 | 25.8 9.4 | 35.8 35.0 | - |
3 Related Work
For graph-based prediction problems, recently, graph neural networks (GNNs) [13] have attracted wide attention. However, interpreting GNNs is not easy in general. According to a recent review of explainable GNNs [14], almost all of explainability studies for GNNs are instance-level explanations, which provides input-dependent explanations (Here, we do not mention each one of input-dependent approaches because the purpose is clearly different from LAGRA). An exception is XGNN [15], in which important discriminative graphs are generated for a given already trained GNN by maximizing the GNN output for a target label. However, unlike our method, the prediction model itself remains black-box, and thus, it is difficult to know underlying dependency between the identified graphs and the prediction.
A classical approach to graph-based prediction problems is the graph kernel [16]. Although graph kernel itself does not identify important substructures, recently, [17] has proposed an interpretable kernel-based GNN, called KerGNN. KerGNN uses a graph kernel function as a trainable filter, inspired by the well-known convolutional networks, and the filter updates the node attributes of the input graph so that it embeds similarity to learned important subgraphs. Then, [17] claims that resulting graph filter can be seen as a key structure. However, a learned subgraph in a graph kernel filter is difficult to interpret. The kernel-based matching does not guarantee the existence of a subgraph unlike our AGIS (1), and further, only 1-hop neighbors of each node in the input graph are matched to a graph filter.
Another graph mining based approach is [18]. This approach also uses a pruning based acceleration for the optimization, but it is based on the optimality of the convex problem while our proximal gradient pruning is applicable to the non-convex problem of LAGRA. Further, more importantly, [18] cannot deal with continuous attributes. The prediction model of LAGRA is inspired by a method for learning time-series shaplets (LTS) [19]. LTS is also based on a linear combination of trainable shaplets, which is a short fragment of a time-series sequence. Unlike time-series data, possible substructures in graph data have a combinatorial nature because of which our problem setting has a computational difficulty that does not exist in the case of LTS, for which LAGRA provide a graph mining based efficient strategy.
4 Experiments
Here, we empirically verify effectiveness of LAGRA. We used standard graph benchmark datasets, called AIDS, BZR, COX2, DHFR, ENZYMES, PROTEINS and SYNTHETIC, retrieved from https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets (for ENZYMES, we only used two classes among original six classes to make a binary problem). To simplify comparison, we only used node labels and attributes, and did not use edge labels and attributes. Statistics of datasets are summarized in supplementary appendix D. The datasets are randomly divided into . For the regularization path algorithm (Algorithm 1), we created candidate values of by uniformly dividing into grid points. We selected , and based on the validation performance.
4.1 Prediction Performance
For the prediction accuracy comparison, we used graph kernels and graph neural networks (GNN). We used three well-known graph kernels that can handle continuous attributes, i.e., graph hopper kernel (GH) [20], multiscale Laplacian kernel (ML) [21] and propagation kernel (PA) [22], for all of which the library called GraKeL [23] was used. For the classifier, we employed the -nearest neighbor (-NN) classification for which each kernel function defines the distance function as . The number of neighbors is optimized by the validation set. For GNN, we used deep graph convolutional neural network (DGCNN) [24], graph convolutional network (GCN) [25], and graph attention network (GAT) [26]. For DGCNN, the number of hidden units and epochs are optimized by the validation set. The other settings were in the default settings of the authors implementation https://github.com/muhanzhang/pytorch_DGCNN. For GCN and GAT, we also selected the number of hidden units and epochs as above. For other settings, we followed [27].
The results are shown in Table 1. LAGRA was the best or comparable with the best method (in a sense of one-sided -test) for BZR, DHFR, PROTEINS and SYNTHETIC (4 out of 7 datasets). For AIDS and COX2, LAGRA has similar accuracy values to the best methods though they were not regarded as the best accuracy in -test. The three GNNs also show stable performance overall. Although our main focus is to build an interepretable model, we see that LAGRA achieved comparable accuracy with the current standard methods. Further, LAGRA only used a small number of AGs shown in the bottom row of Table 1, which suggests high interpretability of the learned models.



4.2 Examples of Selected Attributed Graphlets
We here show examples of identified important AGs. Figure 6 and 7 show AGs having the two largest positive and negative for DHFR and BZR datasets, respectively. In each figure, a labeled graphlet is shown in the left side (the numbers inside the graph nodes are the graph node labels) and optimized attribute vectors for each one of nodes are shown as bar plots in the right side. We can clearly see important substractures not only by as structural information of a graph but also attribute values associated with each node.
Surprisingly, in a few datasets, two classes can be separated even in two dimensional space of AGIS. Figure 2 and Figure 8 show scatter plots of the test dataset (not the training dataset) with the axes of identified features by the LAGRA training. Let and be AGs having the largest positive and negative , respectively. The horizontal and vertical axes of plots are and . In particular, in the AIDS dataset, for which classification accuracy was very high in Table 1, two classes are clearly separated. For DHFR, we can also see points in two classes tend to be located on the upper left side and the lower right side. The dashed lines are boundaries created by (class-balance weighted) logistic regression fitted to the test points in these two dimensional spaces. The estimated class conditional probability has and for AIDS and BZR, respectively, which indicate that differences of two classes are captured even only by two AGs in these datasets.
4.3 Discussion on Computational Time
Finally, we verify computational time of LAGRA. First, Table 2 shows the size of candidate AGs in each dataset. As we describe in Section 2.1.3, this size is equal to , i.e., the number of all the possible subgraphs in the training datasets. Therefore, it can be quite large as particularly shown in the ENZYMES, PROTEINS and SYNTHETIC datasets in Table 2. The optimization variables in the objective function (4) are and . The dimension of is and the node attribute vector exists for each one of nodes in . Thus, the number of optimization variables in (4) is , which can be prohibitively large.
AIDS | BZR | COX2 | DHFR | ENZYMES | PROTEINS | SYNTHETIC |
---|---|---|---|---|---|---|
134281 | 148903 | 101185 | 137872 | 15464000 | 13987000 | 699000 |
Figure 9 shows the computational time during the regularization path. The horizontal axis is of in Algorithm 1. The datasets are AIDS and ENZYMES. In the regularization path algorithm, the number of non-zero typically increases during the process of decreasing , because the penalty becomes weaker gradually. As a results, in both the plots, the total time increases with the index. Although LAGRA performs the traverse of the graph mining tree in every iteration of the gradient update (line 9 in Algorithm 1), Figure 9 shows that the traverse time was not necessarily dominant (Note that the vertical axis is in log scale). In particular, when only a small number of tree nodes are newly expanded at that , the calculation for the tree search becomes faster because AGIS is already computed at the most of tree nodes. The computational times were at most about sec for these datasets. We do not claim that LGARA is computationally faster compared with other standard algorithms (such as graph kernels), but as the computational time of the optimization problem with variables, the results obviously indicate effectiveness of our pruning based optimization approach.


Figure 10 shows the average number of traversed graph mining tree nodes and the size of selected for AIDS and ENZYMES. In this figure, both the values increased with the decrease of because the effect of the sparse penalty becomes weaker. As shown in Table 2, the total number of the tree nodes were and more than for AIDS and ENZYMES, respectively. Figure 10 shows the number of traversed nodes were at most about and , respectively. This clearly indicates that our pruning strategy can drastically reduce the tree nodes, at which the evaluation of is required as shown in Algorithm 2. In the figure, we can see that was further small. This indicates the updated was highly sparse, by which the update of becomes easier because it requires only for non-zero .


5 Conclusion
This paper proposed LAGRA (Learning Attributed GRAphlets), which learns a prediction model that linearly combines attributed graphlets (AGs). In LAGRA, graph structures of AGs are generated through a graph mining algorithm, and attribute vectors are optimized as a continuous trainable parameters. To identify a small number of AGs, the sparse penalty is imposed on coefficients of AGs. We employed a block coordinate update based optimization algorithm, in which an efficient pruning strategy was proposed by combining the proximal gradient update and the graph mining tree search. Our empirical evaluation showed that LAGRA has superior or comparable performance with standard graph classification algorithms. We further demonstrated that LAGRA actually can identify a small number of discriminative AGs that have high interpretability.
Acknowledgments
This work was partially supported by MEXT KAKENHI (21H03498, 22H00300, 23K17817), and International Joint Usage/Research Project with ICR, Kyoto University (2023-34).
References
- [1] L. Ralaivola, S. J. Swamidass, H. Saigo, and P. Baldi, “Graph kernels for chemical informatics,” Neural Networks, vol. 18, no. 8, pp. 1093–1110, 2005.
- [2] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. von Lilienfeld, “Prediction errors of molecular machine learning models lower than hybrid dft error,” Journal of Chemical Theory and Computation, vol. 13, no. 11, pp. 5255–5264, 2017.
- [3] T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,” Phys. Rev. Lett., vol. 120, p. 145301, 2018.
- [4] S.-Y. Louis, Y. Zhao, A. Nasiri, X. Wang, Y. Song, F. Liu, and J. Hu, “Graph convolutional neural networks with global attention for improved materials property prediction,” Phys. Chem. Chem. Phys., vol. 22, pp. 18 141–18 148, 2020.
- [5] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
- [6] N. Pržulj, “Biological network comparison using graphlet degree distribution,” Bioinformatics, vol. 23, no. 2, pp. e177–e183, 2007.
- [7] Y. Xu and W. Yin, “A globally convergent algorithm for nonconvex optimization based on block coordinate update,” Journal of Scientific Computing, vol. 72, pp. 700–734, 2017.
- [8] M. Teboulle, “A simplified view of first order methods for optimization,” Mathematical Programming, vol. 170, pp. 67–96, 2017.
- [9] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
- [10] K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” Schedae Informaticae, vol. 25, p. 49, 2016.
- [11] X. Yan and J. Han, “gSpan: Graph-based substructure pattern mining,” in Proceedings. 2002 IEEE International Conference on Data Mining. IEEE, 2002, pp. 721–724.
- [12] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani, “Pathwise coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 12 2007.
- [13] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
- [14] H. Yuan, H. Yu, S. Gui, and S. Ji, “Explainability in graph neural networks: A taxonomic survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [15] H. Yuan, J. Tang, X. Hu, and S. Ji, “XGNN: Towards model-level explanations of graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery, 2020, pp. 430–438.
- [16] N. M. Kriege, F. D. Johansson, and C. Morris, “A survey on graph kernels,” Applied Network Science, vol. 5, p. 6, 2020.
- [17] A. Feng, C. You, S. Wang, and L. Tassiulas, “KerGNNs: Interpretable graph neural networks with graph kernels,” in Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI. AAAI Press, 2022, pp. 6614–6622.
- [18] K. Nakagawa, S. Suzumura, M. Karasuyama, K. Tsuda, and I. Takeuchi, “Safe pattern pruning: An efficient approach for predictive pattern mining,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1785–1794.
- [19] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme, “Learning time-series shapelets,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery, 2014, p. 392â401.
- [20] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt, “Scalable kernels for graphs with continuous attributes,” in Advances in Neural Information Processing Systems, 2013, pp. 216–224.
- [21] R. Kondor and H. Pan, “The multiscale Laplacian graph kernel,” in Advances in Neural Information Processing Systems, 2016, pp. 2990–2998.
- [22] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine Learning, vol. 102, no. 2, pp. 209–245, 2016.
- [23] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgiannis, “GraKeL: A graph kernel library in python,” Journal of Machine Learning Research, vol. 21, no. 54, pp. 1–5, 2020.
- [24] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Proceedings of AAAI Conference on Artificial Inteligence, 2018.
- [25] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- [26] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- [27] J. You, J. M. Gomes-Selman, R. Ying, and J. Leskovec, “Identity-aware graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 737–10 745.
Supplementary Appendix for
“Learning Attributed Graphlets: Predictive Graph Mining by Graphlets with Trainable Attribute”
Appendix A Update
The objective of can be re-written as
(11) |
For simplicity, we assume that all for have different values (even when this does not hold, the optimal solution can be obtained by the same approach). Depending on , elements in changes in a piecewise constant manner. The point that changes are characterized by the solution of the equation with respect to , i.e., there exist segments on the space of . Let be the -th segment () and is when . Note that , which is the solution of , and and . Under the assumption of , the optimal is
Since (11) is convex with respect to , the obtained must be the optimal solution if it satisfies . Thus, the optimal can be found by calculating for all .
Appendix B Proof of Theorem 2.1
From the definition of AGIS, the following monotonicity property is guaranteed:
Let be the set of injections between and . The above monotonicity property can be easily verified from the fact
Define
Note that the sign of is same as . Using , we re-write as
From the monotonicity inequality , we see
From these inequalities, we obtain
Appendix C Derivation of
When , the objective function of is written as the following piecewise quadratic function:
In the region , the minimum value is achieved by , and for , the minimum value is achieved by . This indicates that the optimal solution should exist in because the objective function is a smooth convex function. Therefore, the minimum value in the region achieved by , defined as , becomes the optimal solution.
When and , we obtain
From (8), we see that is the threshold that have a non-zero value. This means that having the maximum is the first that start having a non-zero value by decreasing from . Therefore, we obtain
Appendix D Statistics of datasets
Statistics of datasets is show in Table 3.
AIDS | BZR | COX2 | DHFR | ENZYMES | PROTEINS | SYNTHETIC | |
# instances | 2000 | 405 | 467 | 756 | 200 | 1113 | 300 |
Dim. of attribute vector | 4 | 3 | 3 | 3 | 18 | 1 | 1 |
Avg. # nodes | 15.69 | 35.75 | 41.22 | 42.43 | 32.58 | 39.06 | 100.00 |
Avg. # edges | 16.20 | 38.36 | 43.45 | 44.54 | 60.78 | 72.82 | 196.00 |
Appendix E Additional Examples of Selected AGs
Figure 11 shows selected AGs for the AIDS dataset.



