This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Attributed Graphlets:
Predictive Graph Mining by Graphlets with Trainable Attribute

Tajima Shinji Nagoya Institute of Technology Ren Sugihara Nagoya Institute of Technology Ryota Kitahara Nagoya Institute of Technology Masayuki Karasuyama Nagoya Institute of Technology
Abstract

The graph classification problem has been widely studied; however, achieving an interpretable model with high predictive performance remains a challenging issue. This paper proposes an interpretable classification algorithm for attributed graph data, called LAGRA (Learning Attributed GRAphlets). LAGRA learns importance weights for small attributed subgraphs, called attributed graphlets (AGs), while simultaneously optimizing their attribute vectors. This enables us to obtain a combination of subgraph structures and their attribute vectors that strongly contribute to discriminating different classes. A significant characteristics of LAGRA is that all the subgraph structures in the training dataset can be considered as a candidate structures of AGs. This approach can explore all the potentially important subgraphs exhaustively, but obviously, a naïve implementation can require a large amount of computations. To mitigate this issue, we propose an efficient pruning strategy by combining the proximal gradient descent and a graph mining tree search. Our pruning strategy can ensure that the quality of the solution is maintained compared to the result without pruning. We empirically demonstrate that LAGRA has superior or comparable prediction performance to the standard existing algorithms including graph neural networks, while using only a small number of AGs in an interpretable manner.

1 Introduction

Prediction problems with a graph input, such as graph classification problems, have been widely studied in the data science community. A graph representation is useful to capture structural data, and graph-based machine learning algorithms have been applied to variety of application problems such as chemical composition analysis [1, 2] and crystal structure analysis [3, 4]. In real-word datasets, graphs often have node attributes as a continuous value vector (note that we only focus on node attributes throughout the paper, but the discussion is same for edge attributes). For example, a graph created by a chemical composition can have a three dimensional position of each atom as an attribute vector in addition to a categorical label such as atomic species. In this paper, we consider building an interepretable prediction model for a graph classification problem in which an input graph has continuous attribute vectors. As we will see in Section 3, this setting has not been widely studied despite its practical importance.

Our framework can identify important small subgraphs, called graphlets, in which each node has an attribute vector. Note that we use the term graphlet simply to denote a small connected subgraph [5], though in some papers, it only indicates induced subgraphs [6]. Figure 1 shows an illustration of our prediction model. In the figure, the output of the prediction model f(G)=β0+βH1ψ(G;H1)+βH2ψ(G;H2))+f(G)=\beta_{0}+\beta_{H_{1}}\psi(G;H_{1})+\beta_{H_{2}}\psi(G;H_{2}))+\cdots for an input attributed graph GG is defined through a linear combination of attributed graphlets (AGs), represented as H1,H2,H_{1},H_{2},\ldots, each one of which is weighted by parameters βH1,βH2,\beta_{H_{1}},\beta_{H_{2}},\ldots. The function ψ(G;H)\psi(G;H) evaluates a matching score between GG and an AG HH in a sense that how precisely GG contains the AG HH. We apply a sparse regularization to the parameter βH\beta_{H} by which a small number of important AGs for classification can be obtained, i.e., an AG with non-zero βH\beta_{H} (in particular, if it has large |βH||\beta_{H}|) can be regarded as a discriminative AG.

Refer to caption
Figure 1: Illustration of our attributed graphlet (AG) based prediction model. The colors of each graph node represents a graph node label, and a bar plot associated with each graph node represents a trainable attribute vector.
Refer to caption
Figure 2: An example of important attributed graphlets H+H^{+} and HH^{-} identified by LAGRA in the AIDS dataset. H+H^{+} and HH^{-} positively and negatively contribute to the prediction, respectively. The right plot is a scatter in which xx- and yy- axes are our graphlet features representing the how precisely H+H^{+} and HH^{-} are included in the input graph GiG_{i}. Each point is from the test dataset.

Important subgraphs and attribute vectors are usually unknown beforehand. The basic strategy of our proposed method, called LAGRA (Learning Attributed GRAphlets), is as follows:

  • To explore potentially important substructures of graphs, i.e., subgraphs, LAGRA uses graph mining by which all the subgraphs in the given dataset can be considered up to the given maximum graph size.

  • For continuous attributes in AGs, we optimize them as trainable parameters.

Figure 2 is an example of identified AGs by LAGRA, which shows only two AGs clearly separate two classes (See Section 4.2 for detail). Since the number of the possible subgraphs is quite large and an attribute vector exists for each node in each one of subgraphs, a naïve implementation becomes computationally intractable. For the efficient optimization, we employ a block coordinate update [7] based approach in which 𝜷\bm{\beta} (a vector containing βH\beta_{H}), the bias term β0\beta_{0}, and attribute vectors are alternately updated. In the alternate update of 𝜷\bm{\beta}, we apply the proximal gradient descent [8, 9], which is known as an effective algorithm to optimize sparse parameters. For this step, we propose an efficient pruning strategy, enabling us to identify dimensions that are not required to update at that iteration. This pruning strategy has the three advantages. First, by combining the sparsity during the proximal update and the graph mining tree search, we can eliminate unnecessary dimensions without enumerating all the possible subgraphs. Second, for removed variables βH\beta_{H} at that iteration, attribute vectors in HH are also not required to be updated, which also accelerates the optimization. Third, our pruning strategy is designed so that it can maintain the update result compared with when we do not perform the pruning (In other words, our pruning strategy does not deteriorate the resulting model accuracy).

Our contributions are summarized as follows:

  • We propose an interpretable graph classification model, in which the prediction is defined through a linear combination of graphlets that have trainable attribute vectors. By imposing a sparse penalty on the coefficient of each AG, a small number of important AGs can be identified.

  • To avoid directly handling an intractably large size of optimization variables, we propose an efficient pruning strategy based on the proximal gradient descent, which can safely ignore AGs that do not contribute to the update.

  • We verify effectiveness of LAGRA by empirical evaluations. Although our prediction model is simple and interpretable, we show that prediction performance of LAGRA was superior to or comparable with well-known standard graph classification methods, and in those results, LAGRA actually only used a small number of AGs. Further, we also show examples of selected AGs to demonstrate the high interpretability.

2 Proposed Method: LAGRA

In this section, we describe our proposed method, called Learning Attributed GRAphlets (LAGRA). First, in Section 2.1, we show the formulation of our model and the definition of the optimization problem. Second, in Section 2.2, we show an efficient optimization algorithm for LAGRA.

2.1 Formulation

2.1.1 Problem Setting

We consider a classification problem in which a graph GG is an input. A set of nodes and edges of GG are written as VGV_{G} and EGE_{G}, respectively. Each one of nodes vVGv\in V_{G} has a categorical label LvL_{v} and a continuous attribute vector 𝒛vGd\bm{z}^{G}_{v}\in\mathbb{R}^{d}, where dd is an attribute dimension. In this paper, an attribute indicates a continuous attribute vector. We assume that a label and an attribute vector are for a node, but the discussion in this paper is completely same as for an edge label and attribute. A training dataset is {(Gi,yi)}i[n]\{(G_{i},y_{i})\}_{i\in[n]}, in which yi{1,+1}y_{i}\in\{-1,+1\} is a binary label and nn is the dataset size, where [n]={1,,n}[n]=\{1,\ldots,n\}. Although we only focus on the classification problem, our framework is also applicable to the regression problem just by replacing the loss function.

2.1.2 Attributed Graphlet Inclusion Score

We consider extracting important small attributed graphs, which we call attributed graphlets (AGs), that contributes to the classification boundary. Note that throughout the paper, we only consider a connected graph as an AG for a better interpretability (do not consider an AG by a disconnected graph). Let ψ(Gi;H)[0,1]\psi(G_{i};H)\in[0,1] be a feature representing a degree that an input graph includes an AG HH. We refer to ψ(Gi;H)\psi(G_{i};H) as the AG inclusion score (AGIS). Our proposed LAGRA identifies important AGs by applying a feature selection to a model with this AGIS feature.

Suppose that L(G)L(G) is a labeled graph having a categorical label LvL_{v} for each node, and in L(G)L(G), an attribute 𝒛vG\bm{z}^{G}_{v} for each node is excluded from GG. We define AGIS so that it has a non-zero value only when L(H)L(H) is included in L(Gi)L(G_{i}):

ψ(Gi;H)={ϕH(Gi)if L(H)L(Gi),0otherwise,\displaystyle\psi(G_{i};H)=\begin{cases}\phi_{H}(G_{i})&\text{if }L(H)\sqsubseteq L(G_{i}),\\ 0&\text{otherwise},\end{cases} (1)

where L(H)L(Gi)L(H)\sqsubseteq L(G_{i}) means that L(H)L(H) is a subgraph of L(Gi)L(G_{i}), and ϕH(Gi)(0,1]\phi_{H}(G_{i})\in(0,1] is a function that provides a continuous inclusion score of HH in GiG_{i}. The condition L(H)L(Gi)L(H)\sqsubseteq L(G_{i}) makes AGIS highly interpretable. For example, in the case of chemical composition data, if L(H)L(H) represents O-C (oxygen and carbon are connected) and ψ(Gi;H)>0\psi(G_{i};H)>0, then we can guarantee that GiG_{i} must contain O-C. Figure 3(a) shows examples of L(Gi)L(G_{i}) and L(H)L(H).

The function ϕH(Gi)\phi_{H}(G_{i}) needs to be defined so that it can represent how strongly the attribute vectors in HH can be matched to those of GiG_{i}. When L(H)L(Gi)L(H)\sqsubseteq L(G_{i}), there exists at least one injection m:VHVGim:V_{H}\rightarrow V_{G_{i}} in which m(v)m(v) for vVHv\in V_{H} preserves node labels and edges among vVHv\in V_{H}. Figure 3(b) shows an example of when there exist two injections. Let MM be a set of possible injections mm. We define a similarity between HH and a subgraph of GiG_{i} matched by mMm\in M as follows

Sim(H,Gi;m)=exp(ρvVH𝒛vH𝒛m(v)Gi2),\displaystyle\mathrm{Sim}(H,G_{i};m)=\exp\left(-\rho\sum_{v\in V_{H}}\left\|\bm{z}^{H}_{v}-\bm{z}^{G_{i}}_{m(v)}\right\|^{2}\right),

where ρ>0\rho>0 is a fixed parameter that adjusts the length scale. In exp\exp, the sum of squared distances of attribute vectors between matched nodes are taken. To use this similarity in AGIS (1), we take the maximum among all the matchings MM:

ϕH(Gi)=MaxPooling({Sim(H,Gi;m):mM})\displaystyle\phi_{H}(G_{i})=\mathrm{MaxPooling}(\left\{\mathrm{Sim}(H,G_{i};m):m\in M\right\}) (2)

An intuition behind (2) is that it evaluates inclusion of HH in GiG_{i} based on the best macthing in a sense of Sim(H,Gi;m)\mathrm{Sim}(H,G_{i};m) for mMm\in M. If L(H)L(Gi)L(H)\sqsubseteq L(G_{i}) and there exists mm such that 𝒛vH=𝒛m(v)Gi\bm{z}^{H}_{v}=\bm{z}^{G_{i}}_{m(v)} for vVH\forall v\in V_{H}, then, ϕH(Gi)\phi_{H}(G_{i}) takes the maximum value (i.e., 11).

Refer to caption
Figure 3: Examples of matchings between a graph and AGs (colors of graph nodes are node labels). (a) For two AGs HH and HH^{\prime}, L(Gi)L(G_{i}) only contains L(H)L(H), and L(H)L(H^{\prime}) is not contained. Then, ψ(Gi;H)>0\psi(G_{i};H)>0 and ψ(Gi;H)=0\psi(G_{i};H^{\prime})=0. (b) An example of the set of injections M={m,m}M=\{m,m^{\prime}\}, where m(1)=2,m(2)=3,m(1)=1m(1)=2,m(2)=3,m^{\prime}(1)=1, and m(2)=4m^{\prime}(2)=4. The figure shows that mm and mm^{\prime} are label and edge preserving.

2.1.3 Model definition

Our prediction model linearly combines the feature ψ(Gi;H)\psi(G_{i};H) as follows:

f(Gi)=Hψ(Gi;H)βH+β0=𝝍i𝜷+β0,\displaystyle f(G_{i})=\sum_{H\in{\mathcal{H}}}\psi(G_{i};H)\beta_{H}+\beta_{0}=\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}, (3)

where βH\beta_{H} and β0\beta_{0} are parameters, {\mathcal{H}} is a set of candidate AGs, and 𝜷\bm{\beta} and 𝝍i\bm{\psi}_{i} are vectors containing βH\beta_{H} and ψ(Gi;H)\psi(G_{i};H) for HH\in{\mathcal{H}}, respectively. Let ={LLL(Gi),i[n],|L|maxpat}{\mathcal{L}}=\{L\mid L\subseteq L(G_{i}),i\in[n],|L|\leq\text{maxpat}\} be a set of all the labeled subgraphs contained in the training input graphs {Gi}i[n]\{G_{i}\}_{i\in[n]}, where |L||L| is the number of nodes in the labeled graph LL and maxpat is the user-specified maximum size of AGs. The number of the candidate AGs |||{\mathcal{H}}| is set as the same size as |||{\mathcal{L}}|. We set {\mathcal{H}} as a set of attributed graphs created by giving trainable attribute vectors 𝒛vH(vVH)\bm{z}_{v}^{H}(v\in V_{H}) to each one of elements in {\mathcal{L}}. Figure 4 shows a toy example. Our optimization problem for βH\beta_{H}, β0\beta_{0} and 𝒛vH\bm{z}_{v}^{H} is defined as the following regularized loss minimization in which the sparse L1L1 penalty is imposed on βH\beta_{H}:

min𝜷,β0,𝒵12i=1n(yi,f(Gi))+λH|βH|,\displaystyle\min_{\bm{\beta},\beta_{0},{\mathcal{Z}}_{\mathcal{H}}}\ \frac{1}{2}\sum_{i=1}^{n}\ell(y_{i},f(G_{i}))+\lambda\sum_{H\in{\mathcal{H}}}|\beta_{H}|, (4)

where 𝒵={𝒛vHvVH,H}{\mathcal{Z}}_{\mathcal{H}}=\{\bm{z}^{H}_{v}\mid v\in V_{H},H\in{\mathcal{H}}\}, and \ell is a convex differentiable loss function. Here, we employ the squared hinge loss function [10]:

(yi,f(Gi))=max(1yif(Gi),0)2.\displaystyle\ell(y_{i},f(G_{i}))=\max(1-y_{i}f(G_{i}),0)^{2}.

Since the objective function (4) induces a sparse solution for βH\beta_{H}, we can identify a small number of important AGs as HH having the non-zero βH\beta_{H}. However, this optimization problem has an intractably large number of optimization variables ({\mathcal{H}} contains all the possible subgraphs in the training dataset and each one of HH\in{\mathcal{H}} has the attribute vector 𝒛vHd\bm{z}_{v}^{H}\in\mathbb{R}^{d} for each one of nodes). We propose an efficient optimization algorithm that mitigates this problem.

Refer to caption
Figure 4: An example of training data, {\mathcal{L}} and {\mathcal{H}}. Since {\mathcal{L}} only includes subgraphs in the training data, “Refer to caption” is not included in {\mathcal{L}}. {\mathcal{H}} is created from {\mathcal{L}} by adding trainable attribute vectors 𝒛vHi\bm{z}^{H_{i}}_{v} (vVHiv\in V_{H_{i}}).

2.2 Optimization

Our optimization algorithm is based on the block coordinate update [7] algorithm, in which the (proximal) gradient descent alternately updates a block of variables. We update one of 𝜷,β0\bm{\beta},\beta_{0} and 𝒵{\mathcal{Z}}_{\mathcal{H}} alternately, while the other two parameters are fixed. First, the proximal gradient update is applied to 𝜷\bm{\beta} because it has the L1L1 penalty. Second, for β0\beta_{0}, we calculate the optimal solution under fixing the other variable because it is easy to obtain. Third, for 𝒵{\mathcal{Z}}_{\mathcal{H}}, we apply the usual gradient descent update because it does not have sparse penalty.

The difficulty of the optimization problem (4) originates from the size of {\mathcal{H}}. We select a small size of a subset 𝒲{\mathcal{W}}\subseteq{\mathcal{H}}, and only 𝜷𝒲|𝒲|\bm{\beta}_{{\mathcal{W}}}\in\mathbb{R}^{|{\mathcal{W}}|}, defined by βH\beta_{H} for H𝒲H\in{\mathcal{W}}, and corresponding attribute vectors 𝒵𝒲={𝒛vHvVH,H𝒲}𝒵{\mathcal{Z}}_{\mathcal{W}}=\{\bm{z}^{H}_{v}\mid v\in V_{H},H\in{\mathcal{W}}\}\subseteq{\mathcal{Z}}_{\mathcal{H}} are updated. We propose an efficient pruning strategy by combining the proximal gradient with the graph mining, which enables us to select 𝒲{\mathcal{W}} without enumerating all the possible subgraphs. A notable characteristics of this approach is that it can obtain the completely same result compared with when we do not restrict the size of variables.

2.2.1 Update 𝜷,β0\bm{\beta},\beta_{0} and 𝒵{\mathcal{Z}}_{\mathcal{H}}

Before introducing the subset 𝒲{\mathcal{W}}, we first describe update rules of each variable. First, we apply the proximal gradient update to 𝜷\bm{\beta}. Let

gH(𝜷)=i=1n(yi,f(Gi))f(Gi)ψ(Gi;H)\displaystyle g_{H}(\bm{\beta})=\sum_{i=1}^{n}{\frac{\partial\ell(y_{i},f(G_{i}))}{\partial f(G_{i})}}\psi(G_{i};H)

be the derivative of the loss term in (4) with respect to βH\beta_{H}. Then, the update of 𝜷\bm{\beta} is defined by

βH(new)prox(βHηgH(𝜷)),\displaystyle\beta^{\rm(new)}_{H}\leftarrow\mathrm{prox}\left(\beta_{H}-\eta\ g_{H}(\bm{\beta})\right), (5)

where η>0\eta>0 is a step length, and

prox(a)={aηλ if aηλ,0 if a(ηλ,ηλ),a+ηλ if aηλ\displaystyle\mathrm{prox}(a)=\begin{cases}a-\eta\lambda&\text{ if }a\geq\eta\lambda,\\ 0&\text{ if }a\in(-\eta\lambda,\eta\lambda),\\ a+\eta\lambda&\text{ if }a\leq-\eta\lambda\end{cases}

is a proximal operator (Note that the proximal gradient for the L1L1 penalty is often called ISTA [9], for which an accelerated variant called FISTA is also known. We here employ ISTA for simplicity). We select the step length η\eta by the standard backtrack search.

The bias term β0\beta_{0} is update by

minβ0i=0nmax(1yi(𝝍i𝜷+β0),0)2,\displaystyle\min_{\beta_{0}}\ \sum_{i=0}^{n}\max(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}),0)^{2},

which is the optimal solution of the original problem (4) for given other variables 𝜷\bm{\beta} and 𝒵{\mathcal{Z}}_{\mathcal{H}}. Since the objective function of β0\beta_{0} is a differential convex function, the update rule of β0\beta_{0} can be derived from the first order condition as

β0(new)i(new)(yi𝝍i𝜷)|(new)|,\displaystyle\beta^{\rm(new)}_{0}\leftarrow\frac{\sum_{i\in{\mathcal{I}}^{\rm(new)}}(y_{i}-\bm{\psi}_{i}^{\top}\bm{\beta})}{|{\mathcal{I}}^{\rm(new)}|}, (6)

where (new)={i1yi(𝝍i𝜷+β0(new))>0}{\mathcal{I}}^{\rm(new)}=\{i\mid 1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta^{\rm(new)}_{0})>0\}. This update rule contains β0(new)\beta^{\rm(new)}_{0} in (new){\mathcal{I}}^{\rm(new)}. However, it is easy to calculate the update (6) without knowing β0(new)\beta^{\rm(new)}_{0} beforehand. Here, we omit detail because it is a simple one dimensional problem (see supplementary appendix A).

For 𝒛vH𝒵\bm{z}^{H}_{v}\in{\mathcal{Z}}_{{\mathcal{H}}}, we employ the standard gradient descent:

𝒛vH(new)𝒛vHαi=1n(yi,f(Gi))f(Gi)×(βHψ(Gi;H)𝒛vH),\displaystyle\bm{z}^{H^{\rm(new)}}_{v}\leftarrow\bm{z}^{H}_{v}-\alpha\sum_{i=1}^{n}{\frac{\partial\ell(y_{i},f(G_{i}))}{\partial f(G_{i})}}\times\left(\beta_{H}{\frac{\partial\psi(G_{i};H)}{\partial\bm{z}^{H}_{v}}}\right), (7)

where α>0\alpha>0 is a step length to which we apply the standard backtrack search.

2.2.2 Gradient Pruning with Graph Mining

In every update of 𝜷\bm{\beta}, we incrementally add required HH into 𝒲{\mathcal{W}}\subseteq{\mathcal{H}}. For the complement set 𝒲¯=𝒲\overline{{\mathcal{W}}}={\mathcal{H}}\setminus{\mathcal{W}}, which contains AGs that have never been updated, we initialize βH=0\beta_{H}=0 for H𝒲¯H\in\overline{{\mathcal{W}}}. For the initialization of a node attribute vector 𝒛vH𝒵\bm{z}_{v}^{H}\in{\mathcal{Z}}_{\mathcal{H}}, we set the same initial vector if the node (categorical) labels are same, i.e., 𝒛vH=𝒛vH\bm{z}_{v}^{H}=\bm{z}_{v^{\prime}}^{H^{\prime}} if Lv=LvL_{v}=L_{v^{\prime}} for H,H\forall H,H^{\prime}\in{\mathcal{H}} (in practice, we use the average of the attribute vectors within each node label). This constraint is required for our pruning criterion, but it is only for initial values. After the update (7), all 𝒛vH\bm{z}^{H}_{v} can have different values.

Since βH=0\beta_{H}=0 for H𝒲¯H\in\overline{{\mathcal{W}}}, it is easy to derive the following relation from the proximal update (5):

|gH(𝜷)|λ and H𝒲¯ 0=prox(βHηgH(𝜷)).\displaystyle|g_{H}(\bm{\beta})|\leq\lambda\text{ and }H\in\overline{{\mathcal{W}}}\ \Rightarrow\ 0=\mathrm{prox}\left(\beta_{H}-\eta\ g_{H}(\bm{\beta})\right). (8)

This indicates that if the conditions in the left side hold, we do not need to update βH\beta_{H} because it remains 0. Therefore, we set

𝒲𝒲{H||gH(𝜷)|>λ,H𝒲¯},\displaystyle{\mathcal{W}}\leftarrow{\mathcal{W}}\cup\left\{H\ \middle|\ |g_{H}(\bm{\beta})|>\lambda,\forall H\in\overline{{\mathcal{W}}}\right\}, (9)

and apply the update (5) only to H𝒲H\in{\mathcal{W}}. However, evaluating |gH(𝜷)|>λ|g_{H}(\bm{\beta})|>\lambda for all H𝒲¯H\in\overline{{\mathcal{W}}} can be computationally intractable because it needs to enumerate all the possible subgraphs. The following theorem can be used to avoid this difficulty:

Theorem 2.1

Let L(H)L(H)L(H^{\prime})\sqsupseteq L(H) and H,H𝒲¯H,H^{\prime}\in\overline{{\mathcal{W}}}. Then,

|gH(𝜷)|g¯H(𝜷)\displaystyle|g_{H^{\prime}}(\bm{\beta})|\leq\overline{g}_{H}(\bm{\beta})

where

g¯H(𝜷)=max{\displaystyle\overline{g}_{H}(\bm{\beta})=\max\Biggl{\{} i{iyi>0}yiψ(Gi;H)(1yi(𝝍i𝜷+β0)),\displaystyle\sum_{i\in{{\mathcal{I}}\cap}\{i\mid y_{i}>0\}}y_{i}\psi(G_{i};H)(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0})),
i{iyi<0}yiψ(Gi;H)(1yi(𝝍i𝜷+β0))},\displaystyle-\sum_{i\in{{\mathcal{I}}\cap}\{i\mid y_{i}<0\}}y_{i}\psi(G_{i};H)(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}))\Biggr{\}},

where ={i1yi(𝛙i𝛃+β0)>0}{\mathcal{I}}=\{i\mid 1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0})>0\}.

See supplementary appendix B for the proof. Note that here {\mathcal{I}} is defined by the current β0\beta_{0} unlike (6). This theorem indicates that the gradient |gH(𝜷)||g_{H^{\prime}}(\bm{\beta})| for any HH^{\prime} whose L(H)L(H^{\prime}) contains L(H)L(H) as a subgraph can be bounded by g¯H(𝜷)\overline{g}_{H}(\bm{\beta}). It should be noted that g¯H(𝜷)\overline{g}_{H}(\bm{\beta}) can be calculated without generating HH^{\prime}, and it mainly needs only the model prediction with the current parameter 𝝍i𝜷+β0\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}, which can be immediately obtained at each iteration, and AGIS ψ(Gi;H)\psi(G_{i};H). The rule (8) reveals that, to identify βH(new)=0\beta_{H^{\prime}}^{\rm(new)}=0, we only require to know whether |gH(𝜷)|λ|g_{H^{\prime}}(\bm{\beta})|\leq\lambda holds, and thus, an important consequence of theorem 2.1 is the following rule:

g¯H(𝜷)λ and H𝒲¯|gH(𝜷)|λ for H{HL(H)L(H),H𝒲¯}.\displaystyle\begin{split}&\overline{g}_{H}(\bm{\beta})\leq\lambda\text{ and }H\in\overline{{\mathcal{W}}}\\ &\ \Rightarrow\ |g_{H^{\prime}}(\bm{\beta})|\leq\lambda\text{ for }\forall H^{\prime}\in\{H^{\prime}\mid L(H^{\prime})\sqsupseteq L(H),H^{\prime}\in\overline{{\mathcal{W}}}\}.\end{split} (10)

Therefore, if the conditions in the first line in (10) hold, any HH^{\prime} whose L(H)L(H^{\prime}) contains L(H)L(H) as a subgraph can be discarded during that iteration. Further, from (7), we can immediately see that attribute vectors 𝒛vH\bm{z}^{H}_{v} for vVH\forall v\in V_{H} are also not necessary to be updated if βH=0\beta_{H}=0. This is an important fact because updates of a large number of variables can be omitted.

Figure 5 shows an illustration of the forward and backward (gradient) computations of LAGRA. For the gradient pruning, an efficient algorithm can be constructed by combining the rule (10) and a graph mining algorithm. A well-known efficient graph mining algorithm is gSpan [11], which creates the tree by recursively expanding each graph in the tree node as far as the expanded graph is included in a given set of graphs as a subgraph. An important characteristics of the mining tree is that all graphs must contain any graph of its ancestors as subgraphs. Therefore, during the tree traverse (depth-first search) by gSpan, we can prune the entire subtree (all descendant nodes) if g¯H(𝜷)λ\overline{g}_{H}(\bm{\beta})\leq\lambda holds for the AG HH in a tree node (Figure 5(b)). This means that we can update 𝒲{\mathcal{W}} by (9) without exhaustively investigating all the elements in 𝒲¯\overline{{\mathcal{W}}}.

Refer to caption
Figure 5: An illustration of LAGRA. a) In the forward pass, only passes with |βH|>0|\beta_{H}|>0 contribute to the output. AGIS is defined by the best matching between an input graph and an AG. b) For the backward pass, the gradient can be pruned when the rule (10) is satisfied. In this illustration, H′′H^{\prime\prime} is pruned by which graphs expanded from H′′H^{\prime\prime} are not required to compute the gradient.

gSpan has another advantage for LAGRA. To calculate the feature ϕH(Gi)\phi_{H}(G_{i}), defined in (2), LAGRA requires a set of injections MM (an example is shown in Figure 3(b)). gSpan keeps the information of MM during the tree traverse because it is required to expand a subgraph in each GiG_{i} (see the authors implementation https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm). Therefore, we can directly use MM created by gSpan to calculate (2).

2.2.3 Algorithm

We here describe entire procedure of the optimization of LAGRA. We employ the so-called regularization path following algorithm (e.g., [12]), in which the algorithm starts from a large value of the regularization parameter λ\lambda and gradually decreases it while solving the problem for each λ\lambda. This strategy can start from highly sparse 𝜷\bm{\beta}, in which usually 𝒲{\mathcal{W}} also becomes small. Further, at each λ\lambda, the solution obtained in the previous λ\lambda, can be used as the initial value by which faster convergence can be expected (so-called warm start).

Algorithm 1 shows the procedure of the regularization path following. We initialize 𝜷=𝟎\bm{\beta}=\bm{0}, which is obviously optimal when λ=\lambda=\infty. In line 2 of Algorithm 1, we calculate λmax\lambda_{\max} at which 𝜷\bm{\beta} starts having non-zero values: λmax=maxH|i[n]nyiψ(Gi;H)(1yiβ0)|\lambda_{\max}=\max_{H\in{\mathcal{H}}}\left|\sum_{i\in[n]}^{n}y_{i}\psi(G_{i};H)(1-y_{i}\beta_{0})\right|, where β0=i[n]yi/n\beta_{0}=\sum_{i\in[n]}y_{i}/n. See supplementary appendix C for derivation. λmax\lambda_{\rm max} can also be written as λmax=maxH|gH(𝟎)|\lambda_{\max}=\max_{H\in{\mathcal{H}}}|g_{H}(\bm{0})|. To find maxH\max_{H\in{\mathcal{H}}}, we can use almost the same gSpan based pruning strategy by using an upper bound of g¯H(𝜷)\overline{g}_{H}(\bm{\beta}) as shown in Section 2.2.2 (the only difference is to search the max\max value only, instead of searching all HH satisfying |gH(𝜷)|>λ|g_{H}(\bm{\beta})|>\lambda), though in Algorithm 1, this process is omitted for brevity. After setting λ0λmax\lambda_{0}\leftarrow\lambda_{\max}, the regularization parameter λ\lambda is decreased by using a pre-defined decreasing factor RR as shown in line 6 of Algorithm 1. For each λ1>>λK\lambda_{1}>\cdots>\lambda_{K}, the parameters 𝜷,β0\bm{\beta},\beta_{0} and 𝒵{\mathcal{Z}}_{\mathcal{H}} are alternately updated as described in Section 2.2.1 and 2.2.2. We stop the alternate update by monitoring performance on the validation dataset in line 14 (stop by thresholding the decrease of the objective function is also possible).

The algorithm of the pruning strategy described in Section 2.2.2 is shown in Algorithm 2. This function recursively traverses the graph mining tree. At each tree node, first, g¯H(𝜷)\overline{g}_{H}(\bm{\beta}) is evaluated to prune the subtree if possible. Then, if |gH(𝜷)|>λk|g_{H}(\bm{\beta})|>\lambda_{k}, HH is included in 𝒲{\mathcal{W}}. The expansion from HH (creating children of the graph tree) is performed by gSpan, by which only the subgraphs contained in the training set can be generated (see the original paper [11] for detail of gSpan). The initialization of the trainable attribute 𝒛vH\bm{z}_{v}^{H^{\prime}} is performed when HH^{\prime} is expanded (line 15).

1 function Reguralization-Path(K,R,MaxEpochK,R,\mathrm{MaxEpoch})
2       H0a graph at the root node of the mining treeH_{0}\leftarrow\text{a graph at the root node of the mining tree}
3       𝒲\mathcal{W}\leftarrow\emptyset
4       𝜷𝟎,β0=i[n]yi/n\bm{\beta}\leftarrow\bm{0},\ \beta_{0}=\sum_{i\in[n]}y_{i}/n
       λ0λmax\lambda_{0}\leftarrow\lambda_{\max}
        Compute λmax\lambda_{\max}​​​​​​
5       for k=1,2,,Kk=1,2,\ldots,K do
6             λkRλk1\lambda_{k}\leftarrow R\lambda_{k-1}
7             for epoch=1,2,,MaxEpoch\mathrm{epoch}=1,2,\ldots,\mathrm{MaxEpoch} do
8                   𝒲𝒲GradientPruning(H0,λk)\mathcal{W}\leftarrow\mathcal{W}\cup\mathrm{GradientPruning}(H_{0},~{}\lambda_{k})
9                   Update 𝜷\bm{\beta} by (5) for H𝒲H\in{\mathcal{W}}
10                   Update β0\beta_{0} by (6)
11                   Update 𝒛vH\bm{z}^{H}_{v} by (7) for H𝒲H\in{\mathcal{W}}
12                   val_loss\mathrm{val\_loss}\leftarrow Compute validation loss
13                   if  val_loss has not been improved in the past qq iterations  then
                         break
                          Inner loop stopping condition​​​​​​
14                        
15                  else
16                         (k)(𝒲,𝜷,β0)\mathcal{M}^{(k)}\leftarrow(\mathcal{W},\bm{\beta},\beta_{0})
17                        
18                  
19            
20      return {(k)}k=0K\{\mathcal{M}^{(k)}\}_{k=0}^{K}
21
Algorithm 1 Optimization of LAGRA
1 function GradientPruning(H,λkH,~{}\lambda_{k})
2       𝒲\mathcal{W}\leftarrow\emptyset
3       if g¯H(𝛃)λk\overline{g}_{H}(\bm{\beta})\leq\lambda_{k} then
             return \emptyset
              Prune the subtree​​​​​​
4            
5      if |gH(𝛃)|>λk|g_{H}(\bm{\beta})|>\lambda_{k} then
6             𝒲𝒲{H}\mathcal{W}\leftarrow\mathcal{W}\cup\{H\}
7      𝒞CreateChildren(H){\mathcal{C}}\leftarrow\mathrm{CreateChildren}(H)
8       for H𝒞H^{\prime}\in{\mathcal{C}} do
9             𝒲𝒲GradientPruning(H,λk)\mathcal{W}\leftarrow\mathcal{W}\cup\mathrm{GradientPruning}(H^{\prime},~{}\lambda_{k})
10      return 𝒲\mathcal{W}
11
12function CreateChildren(HH)
13       if  children of HH have never been created by gSpan  then
14             𝒞{\mathcal{C}}\leftarrow graphs expanded from HH by gSpan
15             for  H𝒞H^{\prime}\in{\mathcal{C}}  do
16                   𝒛vHmean{𝒛vGivVGi,Lv=Lv,i[n]}\bm{z}_{v}^{H^{\prime}}\leftarrow{\rm mean}\{\bm{z}_{v^{\prime}}^{G_{i}}\mid v^{\prime}\in V_{G_{i}},L_{v}=L_{v^{\prime}},i\in[n]\}
17                   Compute {ψ(Gi;H)}i=1n\{\psi(G_{i};H^{\prime})\}_{i=1}^{n} using MM created by gSpan
18            
19      
20
Algorithm 2 Gradient Pruning
Table 1: Classification accuracy and the number of selected non-zero βH\beta_{H} by LAGRA (the bottom row). The average of five runs and its standard deviation are shown. The underlines indicate the best average accuracy for each dataset and the bold-face indicates that the result is comparable with the best method in a sense of one-sided tt-test (significance level 5%). # best indicates frequency that the method is the best or comparable with the best method.
AIDS BZR COX2 DHFR ENZYMES PROTEINS SYNTHETIC # best
GH 0.9985 ±\pm 0.0020 0.8458 ±\pm 0.0327 0.7872 ±\pm 0.0252 0.7250 ±\pm 0.0113 0.6050 ±\pm 0.0857 0.7277 ±\pm 0.0332 0.6767 ±\pm 0.0655 2
ML 0.9630 ±\pm 0.0062 0.8289 ±\pm 0.0141 0.7787 ±\pm 0.0080 0.7105 ±\pm 0.0300 0.6000 ±\pm 0.0652 0.6205 ±\pm 0.0335 0.4867 ±\pm 0.0356 0
PA 0.9805 ±\pm 0.0086 0.8313 ±\pm 0.0076 0.7809 ±\pm 0.0144 0.7316 ±\pm 0.0435 0.7500 ±\pm 0.0758 0.6884 ±\pm 0.0077 0.5400 ±\pm 0.0859 1
DGCNN 0.9830 ±\pm 0.0046 0.8169 ±\pm 0.0177 0.8021 ±\pm 0.0401 0.7289 ±\pm 0.0192 0.7289 ±\pm 0.0192 0.7509 ±\pm 0.0114 0.9867 ±\pm 0.0125 3
GCN 0.9840 ±\pm 0.0030 0.8290 ±\pm 0.0460 0.8340 ±\pm 0.0257 0.7490 ±\pm 0.0312 0.7000 ±\pm 0.0837 0.6880 ±\pm 0.0202 0.9630 ±\pm 0.0194 2
GAT 0.9880 ±\pm 0.0041 0.8220 ±\pm 0.0336 0.7830 ±\pm 0.0274 0.7110 ±\pm 0.0156 0.7100 ±\pm 0.0768 0.7160 ±\pm 0.0108 0.9800 ±\pm 0.0267 2
LAGRA (Proposed) 0.9900 ±\pm 0.0050 0.8892 ±\pm 0.0207 0.8043 ±\pm 0.0229 0.8171 ±\pm 0.0113 0.6450 ±\pm 0.0797 0.7491 ±\pm 0.0142 1.0000 ±\pm 0.0000 4
# non-zero βH\beta_{H} 50.4 ±\pm 17.1 52.4 ±\pm 19.0 45.4 ±\pm 14.9 40.0 ±\pm 11.6 7.2 ±\pm 8.4 25.8 ±\pm 9.4 35.8 ±\pm 35.0 -

3 Related Work

For graph-based prediction problems, recently, graph neural networks (GNNs) [13] have attracted wide attention. However, interpreting GNNs is not easy in general. According to a recent review of explainable GNNs [14], almost all of explainability studies for GNNs are instance-level explanations, which provides input-dependent explanations (Here, we do not mention each one of input-dependent approaches because the purpose is clearly different from LAGRA). An exception is XGNN [15], in which important discriminative graphs are generated for a given already trained GNN by maximizing the GNN output for a target label. However, unlike our method, the prediction model itself remains black-box, and thus, it is difficult to know underlying dependency between the identified graphs and the prediction.

A classical approach to graph-based prediction problems is the graph kernel [16]. Although graph kernel itself does not identify important substructures, recently, [17] has proposed an interpretable kernel-based GNN, called KerGNN. KerGNN uses a graph kernel function as a trainable filter, inspired by the well-known convolutional networks, and the filter updates the node attributes of the input graph so that it embeds similarity to learned important subgraphs. Then, [17] claims that resulting graph filter can be seen as a key structure. However, a learned subgraph in a graph kernel filter is difficult to interpret. The kernel-based matching does not guarantee the existence of a subgraph unlike our AGIS (1), and further, only 1-hop neighbors of each node in the input graph are matched to a graph filter.

Another graph mining based approach is [18]. This approach also uses a pruning based acceleration for the optimization, but it is based on the optimality of the convex problem while our proximal gradient pruning is applicable to the non-convex problem of LAGRA. Further, more importantly, [18] cannot deal with continuous attributes. The prediction model of LAGRA is inspired by a method for learning time-series shaplets (LTS) [19]. LTS is also based on a linear combination of trainable shaplets, which is a short fragment of a time-series sequence. Unlike time-series data, possible substructures in graph data have a combinatorial nature because of which our problem setting has a computational difficulty that does not exist in the case of LTS, for which LAGRA provide a graph mining based efficient strategy.

4 Experiments

Here, we empirically verify effectiveness of LAGRA. We used standard graph benchmark datasets, called AIDS, BZR, COX2, DHFR, ENZYMES, PROTEINS and SYNTHETIC, retrieved from https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets (for ENZYMES, we only used two classes among original six classes to make a binary problem). To simplify comparison, we only used node labels and attributes, and did not use edge labels and attributes. Statistics of datasets are summarized in supplementary appendix D. The datasets are randomly divided into train:validation:test=0.6:0.2:0.2\mathrm{train}:\mathrm{validation}:\mathrm{test}=0.6:0.2:0.2. For the regularization path algorithm (Algorithm 1), we created candidate values of λ\lambda by uniformly dividing [log(λmax),log(0.01λmax)][\log(\lambda_{\max}),\log(0.01\lambda_{\max})] into 100100 grid points. We selected λ\lambda, maxpat{5,10}\text{maxpat}\in\{5,10\} and ρ{1,0.5,0.1,0.05,0.01}\rho\in\{1,0.5,0.1,0.05,0.01\} based on the validation performance.

4.1 Prediction Performance

For the prediction accuracy comparison, we used graph kernels and graph neural networks (GNN). We used three well-known graph kernels that can handle continuous attributes, i.e., graph hopper kernel (GH) [20], multiscale Laplacian kernel (ML) [21] and propagation kernel (PA) [22], for all of which the library called GraKeL [23] was used. For the classifier, we employed the kk-nearest neighbor (kk-NN) classification for which each kernel function k(Gi,Gj)k(G_{i},G_{j}) defines the distance function as GiGj=k(Gi,Gi)2K(Gi,Gj)+k(Gj,Gj)\|G_{i}-G_{j}\|=\sqrt{k(G_{i},G_{i})-2K(G_{i},G_{j})+k(G_{j},G_{j})}. The number of neighbors kk is optimized by the validation set. For GNN, we used deep graph convolutional neural network (DGCNN) [24], graph convolutional network (GCN) [25], and graph attention network (GAT) [26]. For DGCNN, the number of hidden units {64,128,256}\{64,128,256\} and epochs are optimized by the validation set. The other settings were in the default settings of the authors implementation https://github.com/muhanzhang/pytorch_DGCNN. For GCN and GAT, we also selected the number of hidden units and epochs as above. For other settings, we followed [27].

The results are shown in Table 1. LAGRA was the best or comparable with the best method (in a sense of one-sided tt-test) for BZR, DHFR, PROTEINS and SYNTHETIC (4 out of 7 datasets). For AIDS and COX2, LAGRA has similar accuracy values to the best methods though they were not regarded as the best accuracy in tt-test. The three GNNs also show stable performance overall. Although our main focus is to build an interepretable model, we see that LAGRA achieved comparable accuracy with the current standard methods. Further, LAGRA only used a small number of AGs shown in the bottom row of Table 1, which suggests high interpretability of the learned models.

Refer to caption
Figure 6: Selected important AGs for DHFR dataset. (a) and (b): AGs for the two largest positive coefficients. (c) and (d): AGs for the two largest negative coefficients.
Refer to caption
Figure 7: Selected important AGs for BZR dataset. (a) and (b): AGs for the two largest positive coefficients. (c) and (d): AGs for the two largest negative coefficients.
Refer to caption
Figure 8: Scatter plots defined by selected AGs with test dataset of DHFR. The horizontal and vertical axes are AGIS of Fig. 6(a) and Fig. 6(c), respectively.

4.2 Examples of Selected Attributed Graphlets

We here show examples of identified important AGs. Figure  6 and 7 show AGs having the two largest positive and negative βH\beta_{H} for DHFR and BZR datasets, respectively. In each figure, a labeled graphlet L(H)L(H) is shown in the left side (the numbers inside the graph nodes are the graph node labels) and optimized attribute vectors for each one of nodes are shown as bar plots in the right side. We can clearly see important substractures not only by as structural information of a graph but also attribute values associated with each node.

Surprisingly, in a few datasets, two classes can be separated even in two dimensional space of AGIS. Figure 2 and Figure 8 show scatter plots of the test dataset (not the training dataset) with the axes of identified features by the LAGRA training. Let H+H^{+} and HH^{-} be AGs having the largest positive and negative βH\beta_{H}, respectively. The horizontal and vertical axes of plots are ψ(Gi,H+)\psi(G_{i},H^{+}) and ψ(Gi,H)\psi(G_{i},H^{-}). In particular, in the AIDS dataset, for which classification accuracy was very high in Table 1, two classes are clearly separated. For DHFR, we can also see points in two classes tend to be located on the upper left side and the lower right side. The dashed lines are boundaries created by (class-balance weighted) logistic regression fitted to the test points in these two dimensional spaces. The estimated class conditional probability has AUC=0.94{\rm AUC}=0.94 and 0.620.62 for AIDS and BZR, respectively, which indicate that differences of two classes are captured even only by two AGs in these datasets.

4.3 Discussion on Computational Time

Finally, we verify computational time of LAGRA. First, Table 2 shows the size of candidate AGs |||{\mathcal{H}}| in each dataset. As we describe in Section 2.1.3, this size is equal to |||{\mathcal{L}}|, i.e., the number of all the possible subgraphs in the training datasets. Therefore, it can be quite large as particularly shown in the ENZYMES, PROTEINS and SYNTHETIC datasets in Table 2. The optimization variables in the objective function (4) are 𝜷,β0\bm{\beta},\beta_{0} and 𝒵{\mathcal{Z}}_{\mathcal{H}}. The dimension of 𝜷\bm{\beta} is |||{\mathcal{H}}| and the node attribute vector 𝒛vHd\bm{z}_{v}^{H}\in\mathbb{R}^{d} exists for each one of nodes in HH\in{\mathcal{H}}. Thus, the number of optimization variables in (4) is 1+||+H|H|×d1+|{\mathcal{H}}|+\sum_{H\in{\mathcal{H}}}|H|\times d, which can be prohibitively large.

Table 2: The size of the candidate set {\mathcal{H}} with maxpat 10. For the ENZYMES, PROTEINS, and SYNTHETIC datasets, only the lower bounds are shown because it took long time to count.
AIDS BZR COX2 DHFR ENZYMES PROTEINS SYNTHETIC
134281 148903 101185 137872 >> 15464000 >> 13987000 >> 699000

Figure 9 shows the computational time during the regularization path. The horizontal axis is kk of λk\lambda_{k} in Algorithm 1. The datasets are AIDS and ENZYMES. In the regularization path algorithm, the number of non-zero βH\beta_{H} typically increases during the process of decreasing λ\lambda, because the L1L1 penalty becomes weaker gradually. As a results, in both the plots, the total time increases with the λ\lambda index. Although LAGRA performs the traverse of the graph mining tree in every iteration of the gradient update (line 9 in Algorithm 1), Figure 9 shows that the traverse time was not necessarily dominant (Note that the vertical axis is in log scale). In particular, when only a small number of tree nodes are newly expanded at that λ\lambda, the calculation for the tree search becomes faster because AGIS ψ(Gi,H)\psi(G_{i},H) is already computed at the most of tree nodes. The computational times were at most about 10310^{3} sec for these datasets. We do not claim that LGARA is computationally faster compared with other standard algorithms (such as graph kernels), but as the computational time of the optimization problem with 1+||+H|H|×d1+|{\mathcal{H}}|+\sum_{H\in{\mathcal{H}}}|H|\times d variables, the results obviously indicate effectiveness of our pruning based optimization approach.

Refer to caption
(a) AIDS
Refer to caption
(b) ENZYMES
Figure 9: Transition of computational time (sec) on regularization path. For each λ\lambda, the total time and the time required to traverse the graph mining tree (in other words, the time required to identify 𝒲{\mathcal{W}}) is shown separately.

Figure 10 shows the average number of traversed graph mining tree nodes and the size of selected |𝒲||{\mathcal{W}}| for AIDS and ENZYMES. In this figure, both the values increased with the decrease of λ\lambda because the effect of the sparse penalty becomes weaker. As shown in Table 2, the total number of the tree nodes were 134281134281 and more than 1546400015464000 for AIDS and ENZYMES, respectively. Figure 10 shows the number of traversed nodes were at most about 6×103/134281(0.05)6\times 10^{3}/134281(\approx 0.05) and 9×102/15464000(6×105)9\times 10^{2}/15464000(\approx 6\times 10^{-5}), respectively. This clearly indicates that our pruning strategy can drastically reduce the tree nodes, at which the evaluation of gH(𝜷)g_{H}(\bm{\beta}) is required as shown in Algorithm 2. In the figure, we can see that |𝒲||{\mathcal{W}}| was further small. This indicates the updated 𝜷\bm{\beta} was highly sparse, by which the update of 𝒛vH\bm{z}_{v}^{H} becomes easier because it requires only for non-zero βH\beta_{H}.

Refer to caption
(a) AIDS
Refer to caption
(b) ENZYMES
Figure 10: The average number of visited tree nodes and |𝒲||{\mathcal{W}}|.

5 Conclusion

This paper proposed LAGRA (Learning Attributed GRAphlets), which learns a prediction model that linearly combines attributed graphlets (AGs). In LAGRA, graph structures of AGs are generated through a graph mining algorithm, and attribute vectors are optimized as a continuous trainable parameters. To identify a small number of AGs, the L1L1 sparse penalty is imposed on coefficients of AGs. We employed a block coordinate update based optimization algorithm, in which an efficient pruning strategy was proposed by combining the proximal gradient update and the graph mining tree search. Our empirical evaluation showed that LAGRA has superior or comparable performance with standard graph classification algorithms. We further demonstrated that LAGRA actually can identify a small number of discriminative AGs that have high interpretability.

Acknowledgments

This work was partially supported by MEXT KAKENHI (21H03498, 22H00300, 23K17817), and International Joint Usage/Research Project with ICR, Kyoto University (2023-34).

References

  • [1] L. Ralaivola, S. J. Swamidass, H. Saigo, and P. Baldi, “Graph kernels for chemical informatics,” Neural Networks, vol. 18, no. 8, pp. 1093–1110, 2005.
  • [2] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. von Lilienfeld, “Prediction errors of molecular machine learning models lower than hybrid dft error,” Journal of Chemical Theory and Computation, vol. 13, no. 11, pp. 5255–5264, 2017.
  • [3] T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,” Phys. Rev. Lett., vol. 120, p. 145301, 2018.
  • [4] S.-Y. Louis, Y. Zhao, A. Nasiri, X. Wang, Y. Song, F. Liu, and J. Hu, “Graph convolutional neural networks with global attention for improved materials property prediction,” Phys. Chem. Chem. Phys., vol. 22, pp. 18 141–18 148, 2020.
  • [5] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
  • [6] N. Pržulj, “Biological network comparison using graphlet degree distribution,” Bioinformatics, vol. 23, no. 2, pp. e177–e183, 2007.
  • [7] Y. Xu and W. Yin, “A globally convergent algorithm for nonconvex optimization based on block coordinate update,” Journal of Scientific Computing, vol. 72, pp. 700–734, 2017.
  • [8] M. Teboulle, “A simplified view of first order methods for optimization,” Mathematical Programming, vol. 170, pp. 67–96, 2017.
  • [9] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • [10] K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” Schedae Informaticae, vol. 25, p. 49, 2016.
  • [11] X. Yan and J. Han, “gSpan: Graph-based substructure pattern mining,” in Proceedings. 2002 IEEE International Conference on Data Mining.   IEEE, 2002, pp. 721–724.
  • [12] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani, “Pathwise coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 12 2007.
  • [13] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
  • [14] H. Yuan, H. Yu, S. Gui, and S. Ji, “Explainability in graph neural networks: A taxonomic survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [15] H. Yuan, J. Tang, X. Hu, and S. Ji, “XGNN: Towards model-level explanations of graph neural networks,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   New York, NY, USA: Association for Computing Machinery, 2020, pp. 430–438.
  • [16] N. M. Kriege, F. D. Johansson, and C. Morris, “A survey on graph kernels,” Applied Network Science, vol. 5, p. 6, 2020.
  • [17] A. Feng, C. You, S. Wang, and L. Tassiulas, “KerGNNs: Interpretable graph neural networks with graph kernels,” in Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI.   AAAI Press, 2022, pp. 6614–6622.
  • [18] K. Nakagawa, S. Suzumura, M. Karasuyama, K. Tsuda, and I. Takeuchi, “Safe pattern pruning: An efficient approach for predictive pattern mining,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2016, pp. 1785–1794.
  • [19] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme, “Learning time-series shapelets,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   New York, NY, USA: Association for Computing Machinery, 2014, p. 392–401.
  • [20] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt, “Scalable kernels for graphs with continuous attributes,” in Advances in Neural Information Processing Systems, 2013, pp. 216–224.
  • [21] R. Kondor and H. Pan, “The multiscale Laplacian graph kernel,” in Advances in Neural Information Processing Systems, 2016, pp. 2990–2998.
  • [22] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine Learning, vol. 102, no. 2, pp. 209–245, 2016.
  • [23] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgiannis, “GraKeL: A graph kernel library in python,” Journal of Machine Learning Research, vol. 21, no. 54, pp. 1–5, 2020.
  • [24] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Proceedings of AAAI Conference on Artificial Inteligence, 2018.
  • [25] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.   OpenReview.net, 2017.
  • [26] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.   OpenReview.net, 2018.
  • [27] J. You, J. M. Gomes-Selman, R. Ying, and J. Leskovec, “Identity-aware graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 737–10 745.

Supplementary Appendix for
“Learning Attributed Graphlets: Predictive Graph Mining by Graphlets with Trainable Attribute”

Appendix A Update β0\beta_{0}

The objective of β0\beta_{0} can be re-written as

minβ0i=0nmax(1yi(𝝍i𝜷+β0),0)2=minβ0i(1yi(𝝍i𝜷+β0))2.\displaystyle\min_{\beta_{0}}\ \sum_{i=0}^{n}\max(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}),0)^{2}=\min_{\beta_{0}}\ \sum_{i\in{\mathcal{I}}}(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}))^{2}. (11)

For simplicity, we assume that all 𝝍i𝜷\bm{\psi}_{i}^{\top}\bm{\beta} for i[n]i\in[n] have different values (even when this does not hold, the optimal solution can be obtained by the same approach). Depending on β0(new)\beta^{\rm(new)}_{0}, elements in (new)={i1yi(𝝍i𝜷+β0(new))>0}{\mathcal{I}}^{\rm(new)}=\{i\mid 1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta^{\rm(new)}_{0})>0\} changes in a piecewise constant manner. The point that (new){\mathcal{I}}^{\rm(new)} changes are characterized by the solution of the equation yi(𝝍i𝜷+β0)=1(i[n])y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0})=1(i\in[n]) with respect to β0\beta_{0}, i.e., there exist n+1n+1 segments on the space of β0\beta_{0}\in\mathbb{R}. Let B(k)=[β0sk,β0ek]B^{(k)}=[\beta^{s_{k}}_{0},\beta^{e_{k}}_{0}] be the kk-th segment (k{0,,n}k\in\{0,\ldots,n\}) and (k){\mathcal{I}}^{(k)} is (new){\mathcal{I}}^{\rm(new)} when β0(new)B(k)\beta^{\rm(new)}_{0}\in B^{(k)}. Note that β0ek=β0sk+1(k[n])\beta^{e_{k}}_{0}=\beta^{s_{k+1}}_{0}(k\in[n]), which is the solution of yk(𝝍k𝜷+β0)=1y_{k}(\bm{\psi}_{k}^{\top}\bm{\beta}+\beta_{0})=1, and β0s0=\beta^{s_{0}}_{0}=-\infty and β0en=\beta^{e_{n}}_{0}=\infty. Under the assumption of β0B(k)\beta_{0}\in B^{(k)}, the optimal β0\beta_{0} is

β^0(k)=i(k)(yi𝝍i𝜷)|(k)|.\hat{\beta}^{(k)}_{0}=\frac{\sum_{i\in{\mathcal{I}}^{(k)}}(y_{i}-\bm{\psi}_{i}^{\top}\bm{\beta})}{|{\mathcal{I}}^{(k)}|}.

Since (11) is convex with respect to β0\beta_{0}, the obtained β^0(k)\hat{\beta}^{(k)}_{0} must be the optimal solution if it satisfies β^0(k)B(k)\hat{\beta}^{(k)}_{0}\in B^{(k)}. Thus, the optimal β0\beta_{0} can be found by calculating β^0(k)\hat{\beta}_{0}^{(k)} for all k{0,,n}k\in\{0,\ldots,n\}.

Appendix B Proof of Theorem 2.1

From the definition of AGIS, the following monotonicity property is guaranteed:

L(H)L(H) and H,H𝒲¯ψ(Gi;H)ψ(Gi;H)\displaystyle L(H^{\prime})\sqsupseteq L(H)\text{ and }H,H^{\prime}\in\overline{{\mathcal{W}}}\ \Rightarrow\ \psi(G_{i};H)\geq\psi(G_{i};H^{\prime})

Let M(Gi;H)M(G_{i};H) be the set of injections MM between GiG_{i} and HH. The above monotonicity property can be easily verified from the fact

minmM(Gi;H)DH,Gi(m)minmM(Gi;H)DH,Gi(m).\displaystyle\min_{m\in M(G_{i};H)}D_{H,G_{i}}^{(m)}\leq\min_{m\in M(G_{i};H^{\prime})}D_{H^{\prime},G_{i}}^{(m)}.

Define

{ai=yi(1yi(𝝍i𝜷+β0))ifi,ai=0otherwise.\begin{cases}a_{i}=y_{i}(1-y_{i}(\bm{\psi}_{i}^{\top}\bm{\beta}+\beta_{0}))&\mathrm{if}~{}~{}{i\in{\mathcal{I}}},\\ a_{i}=0&\mathrm{otherwise}.\end{cases}

Note that the sign of aia_{i} is same as yiy_{i}. Using aia_{i}, we re-write gH(𝜷)g_{H^{\prime}}(\bm{\beta}) as

gH(𝜷)\displaystyle g_{H^{\prime}}(\bm{\beta}) =iaiψ(Gi;H)\displaystyle=\sum_{i\in{\mathcal{I}}}a_{i}\psi(G_{i};H^{\prime})
=i{iyi>0}aiψ(Gi;H)+i{iyi<0}aiψ(Gi;H).\displaystyle=\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}>0\}}a_{i}\psi(G_{i};H^{\prime})+\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}<0\}}a_{i}\psi(G_{i};H^{\prime}).

From the monotonicity inequality ψ(Gi;H)ψ(Gi;H)\psi(G_{i};H)\geq\psi(G_{i};H^{\prime}), we see

i{iyi<0}aiψ(Gi;H)i{iyi<0}aiψ(Gi;H)gH(𝜷)\displaystyle\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}<0\}}a_{i}\psi(G_{i};H)\leq\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}<0\}}a_{i}\psi(G_{i};H^{\prime})\leq g_{H^{\prime}}(\bm{\beta})
i{iyi>0}aiψ(Gi;H)i{iyi>0}aiψ(Gi;H).\displaystyle\qquad\leq\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}>0\}}a_{i}\psi(G_{i};H^{\prime})\leq\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}>0\}}a_{i}\psi(G_{i};H).

From these inequalities, we obtain

|gH(𝜷)|max{i{iyi>0}aiψ(Gi;H),i{iyi<0}aiψ(Gi;H)}.|g_{H^{\prime}}(\bm{\beta})|\leq\max\left\{\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}>0\}}a_{i}\psi(G_{i};H),-\sum_{i\in{\mathcal{I}}\cap\{i\mid y_{i}<0\}}a_{i}\psi(G_{i};H)\right\}.

Appendix C Derivation of λmax\lambda_{\max}

When 𝜷=𝟎\bm{\beta}=\bm{0}, the objective function of β0\beta_{0} is written as the following piecewise quadratic function:

minβ012i(β0)(1yiβ0)2s.t.(β0)={{iyi>0}β01,[n]β0[1,1],{iyi<0}β01.\min_{\beta_{0}}\frac{1}{2}\sum_{i\in{\mathcal{I}}(\beta_{0})}(1-y_{i}\beta_{0})^{2}~{}~{}\mathrm{s.t.}\ {\mathcal{I}}(\beta_{0})=\begin{cases}\{i\mid y_{i}>0\}&\beta_{0}\leq-1,\\ [n]&\beta_{0}\in[-1,1],\\ \{i\mid y_{i}<0\}&\beta_{0}\geq 1.\end{cases}

In the region β01\beta_{0}\leq-1, the minimum value is achieved by β0=1\beta_{0}=-1, and for β01\beta_{0}\geq 1, the minimum value is achieved by β0=1\beta_{0}=1. This indicates that the optimal solution should exist in β0[1,1]\beta_{0}\in[-1,1] because the objective function is a smooth convex function. Therefore, the minimum value in the region β0[1,1]\beta_{0}\in[-1,1] achieved by β0=i[n]yi/n\beta_{0}=\sum_{i\in[n]}y_{i}/n, defined as y¯\bar{y}, becomes the optimal solution.

When 𝜷=𝟎\bm{\beta}=\bm{0} and β0=y¯\beta_{0}=\bar{y}, we obtain

gH(𝟎)=i[n]yiψ(Gi;H)(1yiy¯).g_{H}(\bm{0})=\sum_{i\in[n]}y_{i}\psi(G_{i};H)(1-y_{i}\bar{y}).

From (8), we see that |gH(𝟎)|=λ|g_{H}(\bm{0})|=\lambda is the threshold that βH\beta_{H} have a non-zero value. This means that HH having the maximum |gH(𝟎)||g_{H}(\bm{0})| is the first HH that start having a non-zero value by decreasing λ\lambda from \infty. Therefore, we obtain

λmax=maxH|i[n]yiψ(Gi;H)(1yiy¯)|.\lambda_{\max}=\max_{H\in{\mathcal{H}}}\left|\sum_{i\in[n]}y_{i}\psi(G_{i};H)(1-y_{i}\bar{y})\right|.

Appendix D Statistics of datasets

Statistics of datasets is show in Table 3.

Table 3: Statistics of datasets.
AIDS BZR COX2 DHFR ENZYMES PROTEINS SYNTHETIC
# instances 2000 405 467 756 200 1113 300
Dim. of attribute vector dd 4 3 3 3 18 1 1
Avg. # nodes 15.69 35.75 41.22 42.43 32.58 39.06 100.00
Avg. # edges 16.20 38.36 43.45 44.54 60.78 72.82 196.00

Appendix E Additional Examples of Selected AGs

Figure 11 shows selected AGs for the AIDS dataset.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 11: Selected important AGs for AIDS dataset. (a) and (b): AGs for the two largest positive coefficients. (c) and (d): AGs for the two largest negative coefficients.