KGNN: Harnessing Kernel-based Networks
for Semi-supervised Graph Classification
Abstract.
This paper studies semi-supervised graph classification, which is an important problem with various applications in social network analysis and bioinformatics. This problem is typically solved by using graph neural networks (GNNs), which yet rely on a large number of labeled graphs for training and are unable to leverage unlabeled graphs. We address the limitations by proposing the Kernel-based Graph Neural Network (KGNN). A KGNN consists of a GNN-based network as well as a kernel-based network parameterized by a memory network. The GNN-based network performs classification through learning graph representations to implicitly capture the similarity between query graphs and labeled graphs, while the kernel-based network uses graph kernels to explicitly compare each query graph with all the labeled graphs stored in a memory for prediction. The two networks are motivated from complementary perspectives, and thus combing them allows KGNN to use labeled graphs more effectively. We jointly train the two networks by maximizing their agreement on unlabeled graphs via posterior regularization, so that the unlabeled graphs serve as a bridge to let both networks mutually enhance each other. Experiments on a range of well-known benchmark datasets demonstrate that KGNN achieves impressive performance over competitive baselines.
1. Introduction
Graph-structured data are ubiquitous in a wide range of domains. Examples include social networks (Song et al., 2019), biological reaction networks (Pavlopoulos et al., 2011), molecules (Sun et al., 2020). For graph-structured data, one fundamental problem is graph classification, which aims at analyzing and predicting the property of the entire graph. Such a problem has various downstream applications, including predicting the properties of molecules (Hao et al., 2020) and analyzing the functionality of compounds (Kojima et al., 2020).
Graph classification is typically formalized as a supervised learning task, and many recent works propose to use graph neural networks (GNNs) (Ying et al., 2018; Lee et al., 2019; Li et al., 2019) to solve the problem. The basic idea is to learn effective graph representations with nonlinear message passing schemas. At each step, a node receives messages from all its neighbors, which are further aggregated to update the node representation. Finally, a readout function is applied to integrate all the node representations into a representation of the whole graph. With this shared message passing framework, the learned graph representations can implicitly capture the similarity between query graphs and labeled graphs in the latent space. Despite the good performance, GNNs usually require a large amount of labeled data for training and fail to leverage unlabeled data. However, data annotation often requires domain experts, which is highly expensive, especially in specific domains such as biomedicine (Hao et al., 2020).
This motivates us to study semi-supervised graph classification, i.e., using both labeled and unlabeled data for graph classification. The unlabeled data serve as a regularizer, which helps a model better explore the inherent graph semantic information even with a limited amount of labeled data. Indeed, there are a handful of works along this line (Li et al., 2019; Sun et al., 2020; Hao et al., 2020), and they typically employ semi-supervised learning techniques to train GNN models. These approaches usually integrate self-training (Lee et al., 2013) or knowledge distillation (Hinton et al., 2014) into GNNs. However, these methods suffer from two key limitations: (1) Unable to well explore graph similarity. Graph classification relies on comparing the query graphs with labeled graphs. Existing methods (Li et al., 2019; Sun et al., 2020; Hao et al., 2020) typically compute the similarity of graphs in an implicit way by projecting graphs into a latent space with a GNN encoder. However, such implicit methods often cannot well explore the similarity of graphs. (2) Suffering from labeled data scarcity. Besides, experimental results show that the performance of existing methods is still unsatisfactory especially when labeled data are very scarce. The reason is that these methods are not able to obtain high-quality annotated data to improve model training. Therefore, we are looking for an approach that is able to better consider the relationship among graphs and meanwhile overcome the challenge of scarce labeled data.
In this paper, we propose such a method called the Kernel-based Graph Neural Network (KGNN). The key idea of KGNN is to enhance GNNs with graph kernels, which are able to explicitly measure the graph similarity of graphs. To leverage graph kernels effectively, we introduce two modules in a KGNN, i.e., a GNN-based network and a kernel-based network. The GNN-based network is parameterized by existing GNNs, which uses message passing mechanisms to learn useful graph representations for graph classification. In contrast, the kernel-based network employs a memory network, where the memory stores the given labeled graphs. Given a query graph, we compare it with all the graphs in the memory using graph kernels, and further integrate the labels of the most similar graphs to predict the label of the query graph.
The GNN-based network and kernel-based network explore graph similarity from different angles, i.e., message passing and graph kernels respectively. Although they are naturally complementary, how to jointly train both networks which enables us to distill the knowledge between each other is nontrivial. We solve the problem with a novel posterior regularization framework (Ganchev et al., 2010), which encourages both networks to collaborate with each other and maximize their agreement on unlabeled data. Each training iteration of posterior regularization consists of two steps. In the first step, the kernel-based network is updated by projecting the GNN-based network into a regularized subspace, yielding a stronger kernel-based network. In the second, we distill the knowledge learned by the kernel-based network into the GNN-based network, so that allowing the GNN to better explore graph similarity and overcome the challenge caused by the scarcity of labeled data.
To summarize, the main contributions of this work are as follows:
-
•
We propose a novel approach for semi-supervised graph classification, which consists of a GNN-based network and a kernel-based network to fully capture the graph similarity and overcome the scarcity of labeled data.
-
•
We develop a novel posterior regularization framework to combine the advantages of graph neural networks and graph kernels in a principled way, such that they can mutually enhance each other via optimizing the two modules with an EM-style algorithm.
-
•
We conduct extensive experiments on a range of well-known benchmarks to demonstrate the effectiveness of our KGNN.
2. Related Work
2.1. Graph Neural Networks
Our work is related to graph neural networks (GNNs), which are able to learn useful graph representations via supervised learning. Most existing methods (Xu et al., 2019; Ying et al., 2018; Lee et al., 2019; Khasahmadi et al., 2020) inherently use a message-passing neural network (Gilmer et al., 2017) to learn node representations, and then aggregate them as graph representations. Because of their effectiveness in learning graph representations, they achieve state-of-the-art results. However, graph representations derived from GNNs often fail to effectively leverage graph similarity for prediction. Besides, these methods rely on a large number of labeled data for training. With KGNN, besides learning effective graph representations derived from GNNs, we also benefit from graph kernels to explicitly incorporate graph similarity in a semi-supervised framework.

2.2. Graph Kernels
Kernels on graphs are originally introduced by (Gärtner et al., 2003; Kashima et al., 2003), and the basic idea is to decompose graphs into atomic substructures and use kernel functions to capture the graph similarity, which can be used in kernel methods for graph classification. Many early methods(Gärtner et al., 2003; Kashima et al., 2003) define feature vectors as counts of label paths using a random walk or shortest path. However, these methods cannot be scaled to large graphs and suffer from high computational complexity. To alleviate these weaknesses, many later graph kernels are designed based on limited-sized substructures (Shervashidze et al., 2009) or neighborhood aggregation (Shervashidze et al., 2011). Some recent studies also try to combine graph kernels and graph neural networks (Du et al., 2019; Chen et al., 2020). GCKNs (Chen et al., 2020) connect two methods via multilayer graph kernels. GNTKs (Du et al., 2019) prove that their kernels are equivalent to infinitely wide GNNs trained by gradient descent. Similar to these methods, our proposed KGNN also combines the advantages of both worlds but from different views. To be specific, our approach not only imposes graph kernels as structured constraints to guide the training process of GNNs, but also uses unlabeled data to assist model training by maximizing the agreement of the two networks on unlabeled data, while their works can only leverage labeled data in a supervised way.
2.3. Semi-supervised Learning
Another category of related work is semi-supervised learning. The self-training method is one of the earliest ideas along this line. It requires a trained classifier to periodically predict class labels of the unlabeled data and add high-quality classified samples to the training set. For example, the entropy minimization method (Grandvalet and Bengio, 2005) requires the classifier to output low-entropy predictions on unlabeled data. Another line of research focuses on perturbation-based approaches. The core idea is to encourage the mapping consistency between the input and output when noise is applied to the input, which is so-called consistency regularization. Among them, the temporal ensembling model (Laine and Aila, 2017) maintains an exponential moving average of label predictions while the mean teacher model (Tarvainen and Valpola, 2017) averages network parameters to obtain a stable target output.
Besides, there are also some recent studies (Li et al., 2019; Sun et al., 2020; Hao et al., 2020) proposed for semi-supervised graph classification. SEAL-AI (Li et al., 2019) approaches the problem in the perspective of the hierarchical graph related to self-training. InfoGraph (Sun et al., 2020) and ASGN (Hao et al., 2020) learn graph representations via contrastive learning and active learning respectively, which both can be boiled down to the teacher-student framework. Compared with existing works, our work focuses on combining the advantages of graph neural networks and graph kernels via posterior regularization, incorporating graph similarity from implicit and explicit perspectives respectively, while their works fail to enhance GNNs from the view of topological similarity.
3. Problem Definition & Preliminary
3.1. Problem Definition
Definition: Semi-supervised Graph Classification. Let represent a graph, where is the set of nodes, is the set of edges. We use denotes the node feature matrix, where is the dimension of features. For semi-supervised graph classification, let denote a set of graphs, in which are labeled graphs and are unlabeled graphs. Additionally, we use to represent the labels corresponding to . The goal of semi-supervised graph classification is to learn a label distribution , which can assign labels to unlabeled graphs .
3.2. Graph Kernels
Given two graphs and , the graph kernel measures the similarity between them and is defined as:
(1) |
where the base kernel , i.e., the inner product on Hilbert space, compares substructures centered at nodes and , and is the feature vector counting the number of appearances of each substructure (e.g., subtrees, paths, graphlets) in the graph .
Afterward, the matrix of pairwise similarities among graphs defined by graph kernels can be passed to kernel-based methods such as the Support Vector Machine to perform graph classification.
Weisfeiler-Lehman Kernel. Inspired by the Weisfeiler-Lehman test of graph isomorphism (Weisfeiler and Lehman, 1968), Weisfeiler-Lehman kernel (Shervashidze et al., 2011) follows the idea of the iterative relabeling process to augment the label of each node through the sorted set of labels of its neighbors in each iteration. The augmented labels are further compressed into a new label. Then the final feature for each graph is the concatenation of label counts for each iteration.
3.3. Posterior Regularization
Posterior regularization (PR) (Ganchev et al., 2010) is a principled framework to impose constraints from the desired distribution on probabilistic models as follows:
(2) |
where is the log-likelihood of model , and is a constraint set, with respect to the expectations of constraint features that are bounded by :
(3) |
To optimize the posterior regularized likelihood , PR presents an iterative procedure akin to an EM-style algorithm:
(4) | |||
(5) |
For better interpretation, it can be viewed as performing coordinate ascent on . Starting from an initial parameter estimate , the algorithm iterates two block-coordinate ascent steps until a convergence criterion is reached.
4. KGNN: Kernel-based Graph Neural Network

4.1. Overview
In this paper, we introduce our approach for semi-supervised graph classification by incorporating both labeled and unlabeled data. Existing methods typically combine graph neural networks with semi-supervised learning techniques. However, these methods are not able to fully leverage graph similarity information for classification, which leads to insufficient expressiveness. Also, their performance is not yet satisfactory when the labeled data are limited.
We address the above limitations by proposing the Kernel-based Graph Neural Network (KGNN), which combines methods based on graph kernels and graph neural networks. The idea is to use graph kernels as structured constraints to guide the training of graph neural networks via posterior regularization, allowing the graph neural network to explicitly exploit graph similarity and receive additional supervision to boost performance. In particular, there are a kernel-based network and a GNN-based network in a KGNN, and we use an EM-style algorithm to jointly train both networks. In the E-step, the kernel-based network represented as a memory network is updated by projecting the GNN-based network to a subspace constrained by the graph similarity. In the M-step, the GNN-based network is improved by distilling the knowledge from the kernel-based network. In this way, both networks can effectively mutually enhance each other. An illustration of the framework is presented in Figure 2. Next, we first introduce the GNN-based network and the kernel-based network. Finally, the training algorithm is explained.
4.2. GNN-based Network
Due to the appealing representation learning ability, graph neural networks have been widely used in graph classification tasks. For example, InfoGraph (Sun et al., 2020) first learns node representations with GIN (Xu et al., 2019), and then summarizes the node representations into the graph-level representation, which serves as graph features for the classification task. Note that in InfoGraph (Sun et al., 2020), the label dependency between different graphs is ignored. We follow the same idea and factorize the joint label distribution as:
(6) |
Furthermore, to learn graph-level representation for a graph , GNN-based methods solely rely on the node attributes and graph structure . In other words, it is independent of other graphs, we therefore define the conditional probability of label for graph as follows:
(7) |
To predict the graph label , a graph neural network first learns node representations by aggregating and updating messages from its neighborhoods. Based on the node representations, a READOUT function summarizes them into the whole graph representation. This procedure can be formalized in the following way:
(8) |
where is the feature vector of node at the -th layer. is often initialized as node features (i.e., ) and are neighborhoods to node . The message-passing phase runs for steps and is defined in terms of aggregation functions and node updating functions , which have a series of possible implementations (Kipf and Welling, 2017; Hamilton et al., 2017; Veličković et al., 2017). After a permutation invariant function such as some pooling functions (Ying et al., 2018; Lee et al., 2019; Khasahmadi et al., 2020), the final graph-level representation is fed into an (Multi-Layer Perceptron) classifier for classification. In this paper, we follow InfoGraph (Sun et al., 2020) and use GIN (Xu et al., 2019) to learn node representations for its high expressive power, and then use the sum operation to produce the graph-level representations (i.e., the READOUT function).

4.3. Kernel-based Network
Graph neural networks can effectively utilize node attributes and graph local structures under the message-passing framework, however, they are insufficient in capturing more global graph topology and ignore the relations between different graphs. This motivates us to additionally incorporate graph kernels, which have been proven useful for encoding graph topology and similarity information (Kriege et al., 2020).
Traditional kernel-based methods perform graph classification tasks in a two-step approach that similarity features are separated from the training phase and are not optimized for downstream tasks, hence the performance is usually unsatisfactory. To address the above limitation, we propose a kernel-based network and implement it with a memory network, which shares a similar idea to classical kernel-based methods(Nowicki and Siegelmann, 2010), but haves larger model capacity and can be optimized in an end-to-end manner. Moreover, memory networks can be naturally and effectively coupled with kernel tricks (García and Moreno, 2004) and introduced to further compare graph-graph similarity via graph kernels for prediction.
In detail, with the assumption of label dependency in Eq. 6, we model the conditional probability as follows
(9) |
where the memories of consists of the labeled data set represented as the feature vector set derived from graph kernels (see. Sec. 3.2). Then each graph are converted into memory vectors by a set of trainable embedding matrix set , where each maps memory content into continuous space at each layer . The query graph (one of the labeled data) is also encoded via another trainable embedding matrix to obtain a hidden representation as a reading head. Afterward, we measure the similarity between and each memory vector by inner product, corresponding to the graph kernel computed on Hilbert space and then followed by a softmax function. The model loops over hops and it computes the attention weights at hop for each memory using:
(10) |
where . Here, is a soft memory selector that decides the memory relevance with respect to . Then, the model reads out the memory by the weighted sum over 111Here is since we use adjacent weighted tying (Sukhbaatar et al., 2015).,
(11) |
Then, the query vector is updated for the next hop by using . The above step leads to the final graph representation , which will be fed into an classifier for prediction as follows:
(12) |
In this way, the representation of graphs can be combined with similarity features of multiple graphs from memories, and the multi-hop attention mechanism helps in learning correlations between memories. Thus, the derived kernel-based network enables high flexibility and is able to effectively exploit graph topology similarity. An illustration of the kernel-based network is presented in Figure 3.
4.4. Optimization Framework with PR
Graph kernels offer effective prior knowledge to explicitly capture graph similarity. To inject this knowledge into graph neural networks efficiently, posterior regularization (PR) is utilized to provide a principled way to impose structured constraints on the posterior distribution of the probabilistic models. However, PR typically uses a fully-specified constraint (see Eq. (3)), which is often sub-optimal. To address this limitation, we represent the desired distribution in (2) as the kernel-based network described in Section 4.3, where this manipulation leads to a more efficient and flexible training algorithm. For simplicity, the GNN-based network and the kernel-based network will be simply written and respectively with an abuse of notation in the rest of the paper.
Based on this, the kernel-based network enables learnable constraints and should be a good complement for the GNN-based network to incorporate the topology information. To this end, we formulate the training objective as:
(13) |
Note that another benefit of modeling the kernel-based network as the memory network is the differentiability of our training objective Eq. (13). It is easy to use stochastic gradient descent algorithms to optimize model parameters. Next, we will give the specific training algorithm for updating parameters in Section 4.5.
4.5. Training Algorithm
In the training phase, our goal is to maximize the posterior regularized likelihood (13) that can be optimized by an iterative procedure akin to an EM-style algorithm, alternated by an E-step and an M-step. In the E-step, the goal is to optimize the kernel-based network by minimizing the KL divergence between the kernel-based network and the GNN-based network. In the M-step, we optimize the GNN-based network by the original likelihood of model and additional guidance derived from the kernel-based network. Next, we introduce the details of the E-step and M-step.
Input: Labeled data , unlabeled data .
Parameter: Model parameter and
Output: Jointly learned and .
E-step. In this step, we optimize the parameter . Recall that the optimization of the kernel-based network parameter is performed by minimizing the KL divergence in Eq. (4). However, directly optimizing the likelihood function requires dealing with the partition function in , which can be difficult. To address the restriction, we propose to instead minimize the reverse KL divergence:
(14) | ||||
Besides, we notice that is a standard discriminative classifier and can be also trained with labeled data. Therefore, we optimize by maximizing the following objective function:
(15) | ||||
M-step. In this step, we optimize the parameter . More specifically, besides optimizing the original likelihood of model , we will fix and further update . Afterwards, the parameter can be updated by maximizing the following overall objective function:
(16) | ||||
We summarize the detailed optimization algorithm for KGNN in Algorithm. 1. The model alternates between optimizing the objective function with regard to and , which is referred to as E-step and M-step. During each step, both labeled and unlabeled data are leveraged for training. In addition to training two networks with labeled data, we essentially select samples with high confidence from the unlabeled data for regularization to let both networks mutually enhance each other. Specifically, given the unlabeled data and their predicted categories , we rank these data according to the probability of distribution (i.e., and ) and then select top- data from the corresponding ranked list. This is similar to the way we do in self-training methods (Lee et al., 2013). Furthermore, to eliminate the influence of incorrect annotations in selected samples, we collect the intersection of the training samples predicted via both networks and regard them as the additional labeled set to enlarge the labeled data, experimental analysis in the next section shows that consistent prediction data are helpful for improving the final performance. The whole iterative procedure stops until convergence or the unlabeled data are exhausted.
5. Experiment
To verify the effectiveness of our proposed method, we first introduce the experimental settings and then present the detailed experiment results and performance analysis.
5.1. Experimental Settings
5.1.1. Evaluation Datasets.
We conduct extensive experiments on seven widely-used semi-supervised graph classification datasets, including two bioinformatics datasets (i.e., PROTEINS (Borgwardt et al., 2005) and DD (Dobson and Doig, 2003)), four social network datasets (i.e., IMDB-B, IMDB-M, REDDIT-B, REDDIT-M-5k (Yanardag and Vishwanathan, 2015)) and one scientific collaboration dataset (i.e., COLLAB (Yanardag and Vishwanathan, 2015)). We summarize the statistics of datasets in Table 1. Following InfoGraph (Sun et al., 2020), we use all-ones encoding as input node features when node attributes are not available in the datasets.
Data Preparation. As these datasets contain labels for all graphs, we need to construct the datasets applicable for semi-supervised learning scenarios. For each dataset, we first split graphs into TRAIN, VAL and TEST sets according to a 7:1:2 ratio. Based on the TRAIN data, we further sample 2/7 of the graphs as labeled data, which is denoted as TRAIN-L and use the rest as unlabeled data TRAIN-U. We use the validation data VAL for hyper-parameter selection and report the results on TEST data set.
5.1.2. Compared Methods
We compare our KGNN with the following methods, which can be divided into three categories: traditional graph methods, traditional semi-supervised learning methods and graph-specific semi-supervised learning methods. Traditional graph methods include Graphlet Kernel (GK) (Shervashidze et al., 2009), Shortest Path Kernel (SP) (Borgwardt and Kriegel, 2005), Weisfeiler-Lehman Kernel (WL) (Shervashidze et al., 2011), Deep Graph Kernel (DGK) (Yanardag and Vishwanathan, 2015), Sub2Vec (Adhikari et al., 2018) and Graph2Vec (Narayanan et al., 2017). Traditional semi-supervised learning methods include EntMin (Grandvalet and Bengio, 2005), -Model (Tarvainen and Valpola, 2017), Mean-Teacher (Tarvainen and Valpola, 2017) and VAT (Miyato et al., 2018). Graph-specific semi-supervised methods include InfoGraph (Sun et al., 2020), ASGN (Hao et al., 2020), GraphCL (You et al., 2020) and JOAO (You et al., 2021). Note that we don’t compare with SEAL-AI (Li et al., 2019) because it requires explicit relations among the graph instances, which are not available in our datasets.
5.1.3. Parameter Settings.
For the proposed KGNN, we use GIN (Xu et al., 2019) to parameterize the GNN-based network , consisting of three graph convolutional layers and one sum-pooling layer, followed by the softmax function, which is the same as InfoGraph (Sun et al., 2020). The kernel-based network is implemented as a memory network with 3 hops, in which the Weisfeiler-Lehman subtree kernel (Shervashidze et al., 2011) serves as the base graph kernel. For a fair comparison, we set the batch size to 32 and the number of epochs to 20 for all datasets. The embedding dimensions of hidden layers are set as 32 for bioinformatics datasets while 64 for social network and scientific collaboration datasets. We apply the dropout (Srivastava et al., 2014) with a ratio of in our KGNN model. In each iteration, and are trained using Adam optimizer (Kingma and Ba, 2015) where initial learning rate is 0.01 and weight decay is 0.0005.
Datasets | Graph Num. | Avg. Nodes | Avg. Edges | Classes |
---|---|---|---|---|
PROTEINS | 1113 | 39.06 | 72.82 | 2 |
DD | 1178 | 284.32 | 715.66 | 2 |
IMDB-B | 1000 | 19.77 | 96.53 | 2 |
IMDB-M | 1500 | 13.00 | 65.94 | 3 |
REDDIT-B | 2000 | 429.63 | 497.75 | 2 |
REDDIT-M-5k | 4999 | 508.52 | 594.87 | 5 |
COLLAB | 5000 | 74.49 | 2457.78 | 3 |
Methods | Datasets | ||||||
---|---|---|---|---|---|---|---|
PROTEINS | DD | IMDB-B | IMDB-M | REDDIT-B | REDDIT-M-5k | COLLAB | |
GK (Shervashidze et al., 2009) | |||||||
SP (Borgwardt and Kriegel, 2005) | |||||||
WL (Shervashidze et al., 2011) | |||||||
DGK (Yanardag and Vishwanathan, 2015) | |||||||
Sub2Vec (Adhikari et al., 2018) | |||||||
Graph2Vec (Narayanan et al., 2017) | |||||||
EntMin (Grandvalet and Bengio, 2005) | |||||||
-Model (Tarvainen and Valpola, 2017) | |||||||
Mean-Teacher (Tarvainen and Valpola, 2017) | |||||||
VAT (Miyato et al., 2018) | |||||||
InfoGraph (Sun et al., 2020) | |||||||
ASGN (Hao et al., 2020) | |||||||
GraphCL (You et al., 2020) | 43.7 1.3 | ||||||
JOAO (You et al., 2021) | |||||||
KGNN (Ours) | 70.9 0.5 | 70.5 0.6 | 72.5 1.6 | 43.3 0.7 | 76.0 0.9 | 44.8 0.6 | 67.4 0.5 |
5.2. Results
We report the quantitative results of semi-supervised graph classification with half of the labeled data TRAIN-L in Table 2. From this table, we make the following observations. (1) We can observe that most of the traditional graph methods perform worse than other methods, which shows that graph neural networks possess the strong representation-learning ability to extract more useful information from graph-structured data. (2) The methods with traditional semi-supervised learning techniques (EntMin, -Model, Mean-Teacher and VAT) show worse performance compared with the graph-specific semi-supervised learning approaches, indicating that recent graph semi-supervised learning techniques are more suitable for the challenging graph classification task. (3) Among the previous state-of-the-art approaches, GraphCL achieves almost the best performance on most datasets. Specifically, other graph-specific semi-supervised learning approaches are not as effective as GraphCL, maybe the reason is that GraphCL takes advantage of the idea of instance discrimination for contrastive learning and specific graph augmentation strategies, and thus further improve the performance. (4) Our framework KGNN achieves the best performance on six of seven datasets, which demonstrates the effectiveness of our framework. From the results, we attribute the performance gain to two aspects: (i) Introduction of graph kernels. Graph kernels acting as structured constraints are able to guide the training process of graph neural networks. (ii) Introduction of posterior regularization framework. Instead of simply integrating the graph neural networks and graph kernels, we use a principled way to combine both worlds, which can mutually enhance each other.
Performance on Different Amounts of Labeled Data. We vary rates of the labeled data for training to show the performance of different models in Figure 4. We take the PROTEINS and IMDB-B as examples. Here we omit the traditional methods since they do not have competitive performance. From the results, we find that the performance of all the methods further improves as the number of available labeled data increases, demonstrating that adding more labeled data is an efficient approach to boost the performance. Among all the methods, the proposed KGNN achieves the best performance, which indicates that explicitly incorporating graph kernels as structured constraints into graph neural networks can further facilitate the performance for graph classification.


5.3. Ablation Study
In the KGNN model, either the GNN-based network or the kernel-based network is enhanced by incorporating extra “labels” selected from both networks. A more straightforward way could be empowering every single network with “labels” generated by itself, which is known as the self-training method. Next, we investigate a few variants that are trained under the self-training framework:
-
•
GNN-Sup. Our base model, which trains a GNN (i.e., network) solely on labeled data TRAIN-L in a fully supervised manner.
-
•
MemNN-Sup. It is another model, which optimizes a MemNN (i.e., network) solely on labeled data in a supervised way.
-
•
GNN-Self. It is a model variant where we optimize a GNN (i.e., ) using a self-training strategy.
-
•
Ensemble-Self. Besides using a single network, we first ensemble the GNN-based network and the kernel-based network by concatenating their graph representations before the final prediction layer, and train the ensembled model with self-training.
-
•
KGNN-Sep. KGNN-Sep differs from the original KGNN in that we directly feed the label annotations by one network into the other network without checking the prediction consistency.
Methods | Datasets | ||||||
---|---|---|---|---|---|---|---|
PROTEINS | DD | IMDB-B | IMDB-M | REDDIT-B | REDDIT-M-5k | COLLAB | |
GNN-Sup | |||||||
MemNN-Sup | |||||||
GNN-Self | |||||||
Ensemble-Self | |||||||
KGNN-Sep | |||||||
KGNN (Ours) | 70.9 0.5 | 70.5 0.6 | 72.5 1.6 | 43.3 0.7 | 76.0 0.9 | 44.8 0.6 | 67.4 0.5 |
Results and Analysis. The results of the above model variants are summarized in Table 3. We can see that GNN-Sup outperforms MemNN-Sup on most datasets, maybe the reason is that GNN-based methods can take advantage of node attributes while graph kernel-based methods fail. Also, GNN-Self outperforms GNN-Sup on 5 out of 7 datasets. On DD and REDDIT-B datasets, GNN-self performs worse than GNN-Sup, which may attribute to the noisy “labels” generated by the model itself. Importantly, we find that Ensemble-Self performs well compared with the above three model variants in most cases, which implies that incorporating graph similarity explicitly indeed benefits the performance. Besides, by using a more principled way to combine two networks, KGNN-Sep outperforms the above four models on all datasets, showing the effectiveness of our proposed posterior regularization. Finally, with the intersection strategy to eliminate the influence of incorrect annotations, our proposed KGNN achieves the best performance on all datasets, which is in accordance with our expectations.
5.4. Parameter Analysis
Finally, we examine the KGNN’s performance varies w.r.t. different parameter settings. Specifically, we investigate the effect of embedding dimensions of hidden layers and the number of hops in the memory network on two datasets PROTEINS and COLLAB.
Effect of the Embedding Dimensions. We first analyze the effect of the embedding dimensions of hidden layers . We expect the model to perform well as dimensions increase due to the benefit of augmented model capacity. We fix all other parameters to the ones that yield the best results and varied , the obtained results are shown in Figure 6. We observe that larger embedding dimensions generally lead to better performance before saturation. The model needs larger embedding dimensions on COLLAB than PROTEINS to achieve the best performance as its dataset is large-scale and requires a larger capacity model to fit.
Effect of the Number of Hops. We further assess the effect of the number of hops in the memory network. We vary the number of hops from 1 to 6 while fixing all other parameters. As shown in Figure 6, we observe that increasing the number of hops results in better performance when the number is small, showing that the multi-hop mechanism has more broad attention and helps in learning correlations between memories. However, too many hops may hurt the performance, maybe the reason is that the query graph pays much attention to how similar it is with the in-memory graphs, which leads to overfitting and loses the generalization of the comparison similarity with the unknown graphs.


5.5. Case Study
In this section, we study the power of the kernel-based network to validate the strength of our framework, showing the superiority from the view of topological similarity. Figure 7 illustrates some of the results we obtained on REDDIT-M-5k, where two unlabeled samples are annotated with both GNN-based network and kernel-based network . We observe that both samples are classified into wrong categories by while corrects the wrong prediction successfully, which shows that our novel kernel-based network is a necessary supplement for the effective graph classification task. The reason may be that the kernel-based network is able to capture more relationships between different substructures and thus acts as structured constraints to guide the GNN-based network.

6. Conclusion
This paper studies semi-supervised graph classification, which is a fundamental problem in graph-structured data modeling, and a novel approach called the KGNN is proposed. KGNN adopts the posterior regularization to incorporate graph kernels as structured constraints to guide the training process of graph neural networks under the EM-style framework. Specifically, in the M-step, we employ a GNN-based network, which can learn effective graph representations for prediction. In the E-step, we develop a kernel-based network using a memory network to capture the graph similarity efficiently. Experiments on a range of well-known benchmark datasets prove the effectiveness of the KGNN for graph classification. In the future, we plan to further extend our KGNN to a broader range of applications, such as drug discovery and material science.
Acknowledgements.
This paper is partially supported by National Key Research and Development Program of China with Grant No. 2018AAA0101902 as well as the National Natural Science Foundation of China (NSFC Grant No. 62106008 and No. 62006004).References
- (1)
- Adhikari et al. (2018) Bijaya Adhikari, Yao Zhang, Naren Ramakrishnan, and B Aditya Prakash. 2018. Sub2vec: Feature learning for subgraphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 170–182.
- Borgwardt and Kriegel (2005) Karsten M Borgwardt and Hans-Peter Kriegel. 2005. Shortest-path kernels on graphs. In In Proceedings of the 5th IEEE International Conference on Data Mining. 8–pp.
- Borgwardt et al. (2005) Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph kernels. Bioinformatics 21 (2005), 47–56.
- Chen et al. (2020) Dexiong Chen, Laurent Jacob, and Julien Mairal. 2020. Convolutional Kernel Networks for Graph-Structured Data. In Proceedings of International Conference on Machine Learning. 1576–1586.
- Dobson and Doig (2003) Paul D Dobson and Andrew J Doig. 2003. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330, 4 (2003), 771–783.
- Du et al. (2019) Simon S. Du, Kangcheng Hou, Ruslan Salakhutdinov, Barnabás Póczos, Ruosong Wang, and Keyulu Xu. 2019. Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5724–5734.
- Ganchev et al. (2010) Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research 11 (2010), 2001–2049.
- García and Moreno (2004) Cristina García and José Alí Moreno. 2004. The hopfield associative memory network: Improving performance with the kernel “trick”. In Ibero-American Conference on Artificial Intelligence. Springer, 871–880.
- Gärtner et al. (2003) Thomas Gärtner, Peter Flach, and Stefan Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives. In Proceedings of Computational Learning theory and kernel machines. 129–143.
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of International Conference on Machine Learning. 1263–1272.
- Grandvalet and Bengio (2005) Yves Grandvalet and Yoshua Bengio. 2005. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems. 529–536.
- Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
- Hao et al. (2020) Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, Qi Liu, Enhong Chen, and Cheekong Lee. 2020. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 731–752.
- Hinton et al. (2014) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014. Distilling the knowledge in a neural network. In Deep Learning and Representation Learning NIPS Workshop.
- Kashima et al. (2003) Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized kernels between labeled graphs. In Proceedings of international conference on machine learning. 321–328.
- Khasahmadi et al. (2020) Amir Hosein Khasahmadi, Kaveh Hassani, Parsa Moradi, Leo Lee, and Quaid Morris. 2020. Memory-based graph networks. In Proceedings of International Conference on Learning Representations.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations.
- Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of International Conference on Learning Representations.
- Kojima et al. (2020) Ryosuke Kojima, Shoichi Ishida, Masateru Ohta, Hiroaki Iwata, Teruki Honma, and Yasushi Okuno. 2020. kGCN: a graph-based deep learning framework for chemical structures. Journal of Cheminformatics 12 (2020), 1–10.
- Kriege et al. (2020) Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network Science 5, 1 (2020), 1–42.
- Laine and Aila (2017) Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In Proceedings of International Conference on Learning Representations.
- Lee et al. (2013) Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML.
- Lee et al. (2019) Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling. In Proceedings of International Conference on Machine Learning. 3734–3743.
- Li et al. (2019) Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. Semi-supervised graph classification: A hierarchical graph perspective. In The World Wide Web Conference. 972–982.
- Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 8 (2018), 1979–1993.
- Narayanan et al. (2017) Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017).
- Nowicki and Siegelmann (2010) Dimitri Nowicki and Hava Siegelmann. 2010. Flexible kernel memory. PloS One 5, 6 (2010), e10955.
- Pavlopoulos et al. (2011) Georgios A Pavlopoulos, Maria Secrier, Charalampos N Moschopoulos, Theodoros G Soldatos, Sophia Kossida, Jan Aerts, Reinhard Schneider, and Pantelis G Bagos. 2011. Using graph theory to analyze biological networks. BioData mining 4, 1 (2011), 1–27.
- Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011), 2539–2561.
- Shervashidze et al. (2009) Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In Proceedings of International Conference on Artificial Intelligence and Statistics. 488–495.
- Song et al. (2019) Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian Tang. 2019. Session-based social recommendation via dynamic graph attention networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 555–563.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In Advances in Neural Information Processing Systems. 2440–2448.
- Sun et al. (2020) Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In Proceedings of International Conference on Learning Representations.
- Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems. 1195–1204.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. In Proceedings of International Conference on Learning Representations.
- Weisfeiler and Lehman (1968) Boris Weisfeiler and Andrei A Lehman. 1968. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia 2, 9 (1968), 12–16.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks?. In Proceedings of International Conference on Learning Representations.
- Yanardag and Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1365–1374.
- Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems. 4805–4815.
- You et al. (2021) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph Contrastive Learning Automated. arXiv preprint arXiv:2106.07594 (2021).
- You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33 (2020), 5812–5823.