HopGAT: Hop-aware Supervision Graph Attention Networks for Sparsely Labeled Graphs

Chaojie Ji [email protected] Ruxin Wang [email protected] Rongxiang Zhu [email protected] Yunpeng Cai [email protected] Hongyan Wu [email protected] Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Abstract

Due to the cost of labeling nodes, classifying a node in a sparsely labeled graph while maintaining the prediction accuracy deserves attention. The key point is how the algorithm learns sufficient information from more neighbors with different hop distances. This study first proposes a hop-aware attention supervision mechanism for the node classification task. A simulated annealing learning strategy is then adopted to balance two learning tasks, node classification and the hop-aware attention coefficients, along the training timeline. Compared with state-of-the-art models, the experimental results proved the superior effectiveness of the proposed Hop-aware Supervision Graph Attention Networks (HopGAT) model. Especially, for the protein-protein interaction network, in a 40% labeled graph, the performance loss is only 3.9%, from 98.5% to 94.6%, compared to the fully labeled graph. Extensive experiments also demonstrate the effectiveness of supervised attention coefficient and learning strategies.

keywords:

Attention supervision , Node classification , Hop-aware , Graph attention network

^†^†journal: review

1 Introduction

Node classification is used to predict the class of unlabeled nodes given a partially labeled graph. Node classification is one of the most important applications in analyzing graphs in various areas, including document classification in social science [1], disease prediction in bioinformatics [2], and department classification of an employer in communication networks [3, 4]. However, labeling nodes for a training task is time consuming and sometimes very expensive [5]. For example, to initially acquire the disease label of a group of disease-causing genes, one must sequence sufficient patient and normal samples [6]. Being able to predict a node class in a sparsely labeled graph while maintaining the prediction accuracy deserves more attention [7].

To address insufficiently labeled data, researchers widely adopt semi-supervised learning, in which both labeled and unlabeled neighbor nodes are utilized [8, 9]. A general assumption in homogeneous graph network research is that a node usually possesses more similar information to its immediate neighbors [10]. These studies learn node representations on the surrounding nodes or link information via the convolution operation [11, 12]. Furthermore, graph attention networks (GATs) [13] and various variants [14, 15] have been proposed to quantify the closeness of node pairs through attention [16].

Refer to caption — Figure 1: The classification accuracy highly depends on the four colored labeled nodes.

However, the deep learning algorithms [17, 18], such as Graph Convolutional Networks (GCNs) [19, 20] and GATs [13, 21], strongly depend on the labeled nodes to train a prediction model, and thus, the performance is limited by the scale of the labeled data. As seen in Figure 1, colored nodes are labeled, while gray nodes are unlabeled. To predict the labels of Node C, the algorithms must fully take advantage of the information from Nodes A, B, D and E. Therefore, the prediction accuracy strongly depends on the four labeled color nodes.

The number of hops, namely, the hop value, is adopted to describe the distance or neighborhood relationship between two nodes in the graph network. The general assumption is that neighbors with different hop values have different influences on their center node. For classification tasks on sparsely labeled graphs, in which a center node has very few labeled neighbors, it is essential to learn more information from more neighbors with different hop values. The key point is how the algorithm learns sufficient semantic information from neighbors with different hop values.

1.1 Motivation

In this study, we make two key observations.

Observation 1: The class labels of neighboring nodes with a smaller hop value are more likely to be consistent with those of their center nodes in the homogeneous graph network.

We examined the data set Cora [22], which is usually used in scientific publication classification tasks, and recorded the label consistence rate, the proportion of neighboring nodes having the same class label as their center nodes. We recorded this consistency rate under different sets of neighboring nodes with a given hop value. In Figure 2, the x-axis is the given hop value, and the y-axis is the label consistency rate.

From the visualization, we can observe that the neighboring nodes with a smaller hop value are more consistent with their center nodes. This is consistent with the general assumption in homogeneous graph networks that a node is more likely similar to its immediate neighbors. This experiment further extends the assumption to neighbors with larger hop values and proves the rationality of introducing hop values to discriminate the closeness of two nodes.

Observation 2: General attention models are unable to automatically learn sufficient semantic information from neighbors with different hop values.

Based on the motivation from Observation 1, we checked the attention coefficients, which should be different between nodes with different hop values since the neighboring nodes with a smaller hop value are more consistent with their center nodes.

We trained a GAT on the Cora citation dataset with the hyper-parameter settings mentioned in [13]. The max hop value is set to 2. All attention coefficients produced by the GATs during the different training epochs are visualized in Figure 3. In this figure, the horizontal axis shows the value of the attention coefficients, the vertical axis records the occurrence number of the coefficients, and the z-axis represents the training epochs. This figure illustrates a Gaussian distribution with only one peak. However, we expect a distribution that has multiple peaks or clear boundaries corresponding to different hop values. This implies that unsupervised attention mechanisms are generally unable to automatically learn the correlation or semantic information for nodes with different hop values in graph networks. This observation is consistent with the work [23] in that attention should be supervised under different conditions.

1.2 Contributions

In this study, we suggest a HopGAT model to address node classification on a sparsely labeled graph. Compared with the general node classification error, we jointly supervise hop-aware attention coefficients in the loss function. Our contributions are described in the following:

1.

To the best of our knowledge, this paper is the first study to propose a hop-aware attention supervision for the node classification task.
2.

We encoded hop values and embedded them into the graph nodes. Subsequently, the graph nodes were simultaneously encoded with semantics and graph structure information.
3.

We proposed a simulated annealing strategy to simultaneously balance two learning tasks, node classification and attention coefficients, along the training timeline.
4.

Experimental results prove the effectiveness of the proposed HopGAT model over the state-of-the-art baselines and quantify our improvement on the sparsely labeled datasets. In addition, the extensive experiments also show the effectiveness of supervised attention coefficients and learning strategy.

2 Related Work

2.1 Weakly-supervised Learning in Graph

Compared with traditional supervised learning, weakly supervised learning aims to address the situation in which precise or sufficient labels are unachievable [24, 25]. In the graph domain, semi-supervised learning is widely used to address incomplete labels in a graph, as in [26, 27]. These studies mainly focus on graph representation. Vashishth et al. proposed ConfGCN to introduce the concept of the locality of labels [28]. They used additional label distribution and co-variance matrices derived from limited labeled nodes. None of these studies took advantage of implicit information in unlabeled neighbors and their hop values. Yang et al. proposed a method whereby node embedding is trained to predict the class label [8]. They used neighbor context to train the node embedding similar to a Skipgram-like model. The coefficients between two nodes are not computed explicitly or trained in the prediction model. Inspired by the locality of labels in ConfGCN and neighbor context in Yang’s methods, we conducted a further analysis of the changing of the node labels with different hop values. Based on the observation that neighboring nodes with smaller hop values are more likely to be consistent with their center nodes, we embed the hop value information in the classification function in this study.

2.2 Attention in Graph

Given the trend whereby convolution operations are being generalized into arbitrary graph networks, more effective methods to aggregate neighboring nodes and locate “closest” nodes from center nodes are desired, as in [29, 30]. An attention mechanism aids a model to “focus on the most relevant parts of the input to make decisions” [31, 32]. Although attention is widely used in locating the closest nodes and have achieved state-of-the-art performance [33], the studies in [34] showed that attention should be supervised to obtain a better performance under different conditions. In this study, we jointly supervise the hop-aware attention coefficients and the node classification error in the loss function to better train an algorithm given insufficiently labeled nodes.

2.3 Supervision on Attention in Graph

To better understand attention mechanism in graph convolution network, Knyazev et al. proposed ChebyGIN and imposed a supervision on attention coefficients as a controlled environment [23]. They noticed the influence of attention and mentioned that the accuracy of a model could depend exponentially on attention correctness. The proposed mechanism of attention supervision is goal-directed which can’t be directly applied in other tasks, such as node classification.

3 Preliminary

Before introducing our proposed method, we provide a brief overview of the semi-supervised GATs which are composed of several single graph attentional layer. We only describe a single graph attention layer here.

Given a graph, the input to current layer are defined as node features, i.e. $h=\{h_{1},h_{2},\cdots,h_{n}\}$ , where $n$ is the number of nodes.

The attention coefficients between two nodes can be calculated as follows:

\alpha_{ij}=\frac{exp(a(h_{i},h_{j}))}{\sum_{k\in\mathcal{N}_{i}}exp(a(h_{i},h_{k}))}

(1)

where $exp$ is the exponential function, $\mathcal{N}_{i}$ is the neighboring nodes of $i$ in the graph, and $a$ is a function used to estimate the importance of one node to another.

Once obtained, the attention coefficients of node $i$ are used to compute a linear combination of the features corresponding to its neighboring nodes, which can be considered as the updated output features $h^{\prime}_{i}$ for node $i$ :

h^{\prime}_{i}=\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}Wh_{j})

(2)

where $W$ is a weight matrix.

Similar to the Transformer [32], multi-head attention is employed in GATs. $K$ independent attention mechanisms execute a transformation as in Equation 2. Then, the produced features are concatenated as

h^{\prime}_{i}={||}^{K}_{k=1}\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha^{k}_{ij}W^{k}h_{j})

(3)

where $||$ denotes the concatenation operation, and $k$ corresponds to the $k$ th head.

For the final layer, GATs employ averaging and nonlinearity to a specific task:

h^{\prime}_{i}=\sigma(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}_{i}}\alpha^{k}_{ij}W^{k}h_{j})

(4)

4 Method

In this section, we describe the architecture of our model - HopGAT. The model has three main components: a hop encoding and attention mechanism, attention supervision, and a learning strategy, as shown in Figure 4. The other modules shown in the figure are inherited from the GATs.

In component 1, we first encode the hop value into a vector and embed the vector into each node feature. Then, we calculate the attention coefficients of each center node to its neighboring nodes with hop information. Finally, we apply multi-head hop attention for each layer, which can draw different features of the graph network. In the next layer, the node features are updated according to features from the previous layer, and the attention coefficients for each node pair are delivered to the attention supervision. In component 2, all the attention coefficients from every head are collected. The gap between the computed coefficients and the defined ground-truth coefficients are summed to form an attention loss, which will be one part of the total loss function used to train the prediction model. In the last layer, the node classification loss is calculated. Once the attention loss and node classification loss are obtained, our learning strategy is performed in component 3 to balance these two types of losses during the training procedure.

4.1 Attention Mechanism

4.1.1 Hop Encoding

Existing attention-based graph methods usually specify the maximum hop value between a center node and its neighbors, e.g. 1, in GATs. Then, the numerical hop value in a graph is coarsely equivalent to a Boolean variable representing whether a node is a neighbor of another node. However, as we have noted in Observation 1, the class labels of neighboring nodes with a smaller hop value are more likely to be consistent with those of their center nodes in a homogeneous graph network, and the hop value offers more information than a Boolean variable.

In this study, a hop value is used to express the closeness or similarity of a node to its center node. Therefore, we encode a hop value into a $d$ -dimensional vector as $he$ :

inv_{i}=exp(i*(-\frac{log(max_{hv})}{max(\frac{d}{2}-1,1)}))

(5)

he_{hv,i}=\left\{\begin{aligned} sin(hv*inv_{i}),&&i<=d/2\\ cos(hv*inv_{i}),&&i>d/2\\ \end{aligned}\right.

(6)

where $max_{hv}$ is the pre-defined maximal hop value, $max(\cdot)$ is a maximum value function, $d$ is the dimension of the hop embedding, $i$ is the index of the dimensions, $hv$ is the corresponding hop value, $sin$ and $cos$ are sine and cosine functions, respectively, and $he_{hv}$ represents the hop encoding for the hop value $hv$ .

Through this definition, not only the absolute hop value but also the relative hop value can be learned since, as a fixed offset $p$ , $he_{hv+p}$ can be transformed as a linear function of $he_{hv}$ . Since the number of heads in each layer can be different, the dimensions of nodes in each hidden layer are also different. Therefore, the dimension of the hop embedding in each layer may be different.

4.1.2 Attention Mechanism

We propose two attention mechanisms based on the above hop encoding: product-based attention and addition-based attention.

For a node pair $(i,j)$ , $i$ and $j$ are the center and neighboring nodes, respectively. $hv_{i,j}$ is the hop value from $i$ to $j$ . The hop encoding for the $t$ th layer is written as $he^{t}$ , where each line represents a hop encoding for a hop value.

1.

Product-based Attention This attention mechanism is mainly based on the dot product between the embeddings of two nodes.

$e_{ij}=a_{c}(h_{i})\odot(a_{n}(h_{j})+lookup(a_{he}(he^{t}),hv_{i,j}))$ (7)

$e_{ij}$ indicates the importance of node $j$ to node $i$ , $a_{c}$ , $a_{n}$ and $a_{he}$ are three independent single-layer feedforward neural networks, $\odot$ denotes the dot product operation, and $lookup$ is a function that looks up the hop encoding for the hop value $hv_{i,j}$ in $he^{t}$ .
2.

Addition-based Attention This mechanism is based on the addition operation on two node embeddings. The attention coefficients are formulated as follows:

$\begin{split}e_{ij}=LeakyReLU(lookup(a_{he}(he^{t}),hv_{i,j})\\ \odot(a_{c}(h_{i})||a_{n}(h_{j})))\end{split}$ (8)

where the symbol $||$ is a concatenation operation. The LeakyReLU activation function is used to execute the non-linearization operation.

To make coefficients easily comparable, we normalize them across all neighbors $j$ using the softmax function:

\alpha_{ij}=softmax_{j}(e_{ij})=\frac{exp(e_{ij})}{\sum_{k\in\mathcal{N}_{i}}exp(e_{ik})}

(9)

where $\mathcal{N}_{i}$ is the neighboring node set of node $i$ . Once $\alpha_{ij}$ is obtained, we use Equations 3 - 4 to perform the final node classification.

The hop value of self-connection is defined as 0, and it is also encoded into a non-zero hop encoding vector. Through this design, all attention coefficients are uniformly calculated according to the node features and hops.

4.2 Attention Supervision

4.2.1 Ground-truth Attention

The hop value represented in the graph data is a label-free and effective indicator for quantifying the correlation between two nodes. Specifically, the correlation between two directly connected nodes in the graphs should be assigned a larger value, while the coefficient between two indirectly connected or disconnected nodes is expected to have a smaller value. In addition, a prior investigation showed that “classification accuracy depends exponentially on attention correctness” [23] and that the general attention models cannot effectively automatically learn sufficient semantic information from neighbors with different hop values. All of this motivates us to develop a hop-aware approach to supervise the training process for attention coefficients between two nodes.

Generally, the correlation between two nodes becomes weaker when their hop value is over a certain value. Therefore, we set a boundary for the hop value. The coefficients between the center node and its neighboring nodes are uniformly grouped into a default value when the hop value reaches the maximum value, $max_{hv}$ , which means a weak connection between two nodes. We formulate the definition of ground-truth attention $e^{GT}$ as follows:

e^{GT}_{ij}=\left\{\begin{aligned} 1,&&hv_{i,j}=0\\ 1-hv_{i,j},&&0<hv_{i,j}<max_{hv}\\ 1-max_{hv},&&hv_{i,j}>=max_{hv}\end{aligned}\right.

(10)

This can be interpreted as follows: the larger the hop value is, the smaller the ground-truth attention.

When the hop value is greater than two, a negative ground-truth attention value will be assigned. This negative value does not indicate a negative association between these two nodes. This coefficient has not been normalized by the softmax function as in Equation 9. After normalization, the attention coefficients will be between 0 and 1, indicating the strong and weak correlation between two nodes.

To be more accurate, we use $e^{kl}_{ij}$ to denote the attention coefficient between nodes $i$ and $j$ of the $k$ th head in the $l$ th layer, which are produced during the training procedure as in Equations 7 and 8. We use the mean square error to control the distance between the ground-truth attention $e^{GT}$ and the computed coefficients $e^{kl}_{ij}$ .

\mathcal{L}_{att}=\frac{\sum_{i=1}^{L}\sum_{j=1}^{K}\sum_{l=1}^{N}\sum_{k=1}^{N}{(e_{ij}^{GT}-e_{ij}^{kl})^{2}}}{L*K*N*N}

(11)

$\mathcal{L}_{att}$ will be used as a part of the loss function to supervise the training process for the attention coefficients between two nodes.

4.2.2 Sample Strategy

In total, there are $L*K*N*N$ attention coefficients in the calculation of $\mathcal{L}_{att}$ . To decrease the computation cost, especially in a graph with a large number of nodes, we thus proposed a random sampling strategy, $sample(r)$ , with which a subset of node pairs is sampled with sample ratio $r$ .

The number of node pairs with $hv_{i,j}>=max_{hv}$ is greater than that with less than $max_{hv}$ . For example, in the Citeseer dataset, there are approximately 12,000 node pairs with less than 2 hops, whereas there is approximately 11,000,000 pairs with more hops. For balancing the distribution between node pairs with hop values of greater or less than $max_{hv}$ , $sample(r)$ only samples from the node pairs with hop values greater than $max_{hv}$ . Furthermore, we sample each batch differently to guarantee the diversity of the training data. $\mathcal{L}_{att}$ is calculated as follows:

\mathcal{L}_{att}=\frac{\sum_{(i,j)\in sample(r)}\sum_{l=1}^{L}\sum_{k=1}^{K}{(e_{ij}^{GT}-e_{ij}^{kl})^{2}}}{L*K*count(sample(r))}

(12)

4.3 Learning Strategy

In addition to the loss function $\mathcal{L}_{att}$ used to supervise the attention coefficients, we also include the general node classification loss to measure the classification error.

The final objective for optimization is the linear combination of these two terms.

\mathcal{L}=(1-\gamma)\mathcal{L}_{cls}+\gamma\mathcal{L}_{att}

(13)

where $\gamma$ is used to find a balance between node classification and attention supervision losses.

Inspired by the analysis that attention coefficients are more likely to be imprecise at the beginning of training [23], more powerful supervision of the attention should be imposed at the early stage of the training. From another perspective, $\mathcal{L}_{cls}$ is a “strong” label and is closely related to the final task goal, i.e., node classification, while $\mathcal{L}_{att}$ is auxiliary. Thus, a consideration of balancing them with time is necessary. A simulated annealing procedure is adopted to help the model find the best combination of these two parts of the total loss function.

We first define the transformation of the temperature along the training time:

temp^{t}=\left\{\begin{aligned} temp_{ini},&&t=0\\ temp^{t-1}*\epsilon,&&temp^{t-1}*\epsilon>=temp_{fin}\ and\ t>0\\ temp^{t-1},&&temp^{t-1}*\epsilon<temp_{fin}\ and\ t>0\\ \end{aligned}\right.

(14)

where $temp_{ini}$ and $temp_{fin}$ are the initial and final temperatures, respectively; $temp^{t}$ indicates the temperature at the $t$ th time step; $\epsilon$ is the decay rate; and $temp_{ini}$ , $temp_{fin}$ and $\epsilon$ are all pre-defined hyperparameters.

Then, $\mathcal{L}_{cls}$ and $\mathcal{L}_{att}$ are biased through $\gamma$ along the training time as follows:

\gamma=\left\{\begin{aligned} min(exp(-\frac{\frac{1}{\mathcal{L}_{att}}}{temp^{t}}),\gamma_{str}),&temp^{t-1}*\epsilon<temp_{fin}\\ exp(-\frac{\frac{1}{\mathcal{L}_{att}}}{temp^{t}}),&otherwise\end{aligned}\right.

(15)

where $min$ is the minimum function. $\gamma_{str}$ is a hyperparameter designed to prevent a sharp increase of $\mathcal{L}_{att}$ at the tail of the training process. The learning procedure will be explained in greater detail in the experiment section.

$\mathcal{L}_{cls}$ measures the classification error and is applied to all the labeled nodes during the training procedure. If the labeled nodes in the graph network are insufficient, the trained model will not be sufficiently accurate. In this study, the hop value is introduced as an effective supplementary to the insufficient labels to train the model but without increasing the labeling cost.

5 Experiments

In this section, we evaluate the performance of the proposed hop-aware attention supervision model to address the node classification task on a sparsely labeled graph network. We also investigate the effectiveness of the supervised attention coefficients and our learning strategy.

5.1 Dataset

This experiment includes two types of tasks: inductive learning and transductive learning. For the semi-supervised tasks, if the unlabeled test nodes do not participate in the training procedure, we call this task inductive learning, whereas if the unlabeled test data are all observed and utilized during the training phase, the task is transductive learning.

Cora, Citeseer and PubMed are chosen as our benchmark datasets for transductive tasks [22]. In all of these datasets, nodes denote documents, and edges correspond to citation relations. Node features are elements of a bag-of-words representation of a document. The task is to predict the unique document class among multiple documents. The protein-protein interaction (PPI) dataset is used to evaluate inductive tasks as in [35]. Each node is a protein. Positional and motif gene sets and immunological signatures are used to represent a protein. It is a multi-label task that simultaneously predicts multiple protein functions (labels). We used the preprocessed data from [36] in our experiments.

We evaluate how the proposed algorithm works on the insufficiently labeled graph network. Therefore, we change the proportions of the labeled nodes in the training set. We then reorganize all the datasets.

	#Nodes	#G	#F	# C	#VN	#TN	# Training Nodes with Label Rate (%)
							20	40	60	80	100
Cora	2708	1	1433	7	500	1000	242	484	725	967	1208
Citeseer	3327	1	3703	6	500	1000	363	725	1008	1450	1812
PubMed	19717	1	500	3	500	1000	3644	7287	10931	14574	18217
PPI	56944	24	50	121	6514	5524	8982	17963	26944	35925	44906

Table 1: Number of randomly sampled nodes. Column #G, #F, #C, #VN and #TN denote the number of graphs, features, classes, validation nodes and test nodes, respectively.

The validation and test sets completely follow the experimental setup of [13]. In all transductive tasks, 500/1000 nodes serve for the validation/test sets; 6514/5524 nodes (in 2/2 graphs) are used for the validation/test sets in the inductive tasks. The remaining nodes are placed into the training set. Then, we randomly sample nodes at rates of 20%, 40%, 60%, 80%, and 100% without replacement from the training set for each dataset. We reserve the labels of the sampled nodes but mask the labels of other nodes. Subsequently, we obtained 5 variants for each dataset. The details of the number of nodes, classes, graphs and features in each dataset are listed in Table 1.

5.2 Experimental Setup

We compared the proposed HopGAT against state-of-the-art methods. GATs ¹¹1https://github.com/PetarV-/GAT, GCNs ²²2https://github.com/tkipf/gcn [19] and ConfGCN ³³3https://github.com/malllabiisc/ConfGCN [28] were chosen for the transductive tasks. GraphSAGEs⁴⁴4https://github.com/williamleif/GraphSAGE [36] and GATs were chosen for the inductive tasks. Furthermore, we selected different variants of GraphSAGEs, i.e., GraphSAGE with the mean-based aggregator (Mean), LSTM-based aggregator (Seq), max-pooling aggregator (Maxpool), mean-pooling aggregator (Meanpool) and GCN-based aggregator (GCNagg). For each task, we run the same experiment five times and record the average performances and the standard deviations. For the inductive tasks, we used the metric of the micro-averaged $F_{1}$ score and the accuracy for the transductive tasks instead.

	$dp_{1}$	$dp_{2}$	$dp_{3}$	$L_{2}$	Attention	# L	# Heads	# Features	BS	LR
Cora	0.2	0.0	0.2	0.0001	Addition	2	[8,1]	[8,7]	1	0.005
Citeseer	0.6	0.2	0.6	0.0	Addition	2	[8,1]	[8,6]	1	0.005
PubMed	0.0	0.0	0.0	0.0	Addition	2	[8,8]	[8,3]	1	0.01
PPI	0.0	0.0	0.0	0.0	Product	3	[4,4,6]	[256,256,121]	2	0.005

Table 2: Common Hyperparameters of GATs and HopGAT on different tasks. Column #L, #BS and #LR denote the number of layers, batch size and learning rate, respectively.

Table 3: Hyperparameters of GCNs and ConfGCNs in transductive tasks.

	GCN			ConfGCN
	Cora	Citeseer	PubMed	Cora	Citeseer	Pubmed
dp	0.1	0.4	0.0	0.8	0.3	0.0
$L_{2}$	1e-4	0.0	0.0	0.01	0.05	0.0

We applied dropout [37], skip connections [38] and $L_{2}$ regularization techniques to alleviate over-fitting. For each layer in the HopGAT model, we applied three dropout units: when receiving the updated node representation from the previous layer, archiving the normalized attention coefficients in Equation 2, and obtaining the transformed node representation in Equation 2. We denote these dropout rates as $dp_{1}$ , $dp_{2}$ and $dp_{3}$ , respectively. The skip connections were employed across the intermediate layers when the number of layers was greater than two. An exponential linear unit (ELU) [39] instead of a $\sigma$ function in Equation 2 is implemented when the computation does not occur in the last layer. We applied a single-layer feedforward neural network when the dimension of the inputs is not equal to the number of features in the last layer. We fixed the number of output units in the single-layer feedforward neural networks to 1, i.e., $a_{c}$ , $a_{n}$ and $a_{he}$ in Equation 7. The common parameters of GATs and HopGATs are presented in detail in Table 2. We mainly adjusted the dropout rate and $L_{2}$ regularization for GCN and ConfGCN, which are listed in Table 3. The coefficient of $L_{smooth}$ in ConfGCN’s objective function was set to 1.0 in the PubMed dataset. The other hyperparameters not mentioned here were all derived from their original publications, i.e., GCN [19], ConfGCN [28] and GraphSAGEs [36].

We used the Adam SGD optimizer [40]. An early stopping strategy is applied to the loss for the node classification and also the accuracy (or micro-F1) on the validation nodes with a patience of 100 epochs in all experiments. For the learning strategy, the initial temperature is set to 100, and the lowest temperature is 1. The decay rate for the Cora dataset is 0.95, and it is 0.85 for the other datasets. The saturation gamma is 0.25. For the sampling strategy, we used different percentages to sample node pairs - 0.0003, 0.0005 and 0.0001 for Cora, Citeseer and PubMed, respectively.

5.3 Results

5.3.1 Inductive Task

	GCNagg	Mean	Meanpool	Maxpool	Seq	GAT	Hop-GAT	Imp
PPI(100%)	51.6 $\pm$ 0.5	58.0 $\pm$ 0.9	59.0 $\pm$ 0.5	60.2 $\pm$ 0.7	61.2 $\pm$ 0.5	97.3 $\pm$ 0.2	98.5 $\pm$ 0.1	+1.2
PPI(80%)	45.3 $\pm$ 0.7	55.6 $\pm$ 0.2	52.8 $\pm$ 1.0	52.6 $\pm$ 0.7	56.2 $\pm$ 0.6	96.8 $\pm$ 0.2	97.8 $\pm$ 0.3	+1.0
PPI(60%)	46.9 $\pm$ 0.4	54.4 $\pm$ 0.7	52.5 $\pm$ 1.4	51.0 $\pm$ 2.3	54.0 $\pm$ 0.5	95.2 $\pm$ 0.4	96.7 $\pm$ 0.6	+1.5
PPI(40%)	48.8 $\pm$ 0.8	52.9 $\pm$ 0.2	49.8 $\pm$ 0.4	48.8 $\pm$ 0.2	52.2 $\pm$ 0.9	91.5 $\pm$ 0.4	94.6 $\pm$ 1.8	+3.1
PPI(20%)	44.1 $\pm$ 0.3	50.5 $\pm$ 0.4	43.2 $\pm$ 1.1	43.6 $\pm$ 1.5	48.4 $\pm$ 0.7	83.3 $\pm$ 0.3	88.6 $\pm$ 1.2	+5.3

Table 4: F1 score of inductive tasks. Symbols Imp denotes the improvement.

The results for the inductive tasks are listed in Table 4. The improvement column records the improved values of HopGAT against GATs, which shows better performance compared to other baseline methods.

Compared with GATs, the results demonstrate that HopGATs obtain a significant gain across all datasets with different proportions of labeled nodes. Specifically, the smallest improvement is 1.0% for the 80% label rate, and the highest improvement is 5.3% for the 20% label rate. It is observed that given fewer labeled nodes, a higher performance gain is achieved, from 1.2% to 5.3% - except for a small fluctuation at the 80% label rate. The results show that the proposed model is effective in addressing the inductive task on the sparsely labeled graph network.

Another important observation from the table is that we cannot achieve a proportional performance gain by labeling additional nodes. Labeling more nodes, e.g., from 20% to 40%, the performance gains is nearly 6%. With a 40% labeled graph, the performance loss is only 3.9%, from 98.5% to 94.6%, compared to the fully labeled graph. More labeled nodes results in smaller performance gains. Therefore, if necessary, we should balance the cost of labeling more nodes and performance gain.

5.3.2 Transductive Tasks

	GCN	Conf-GCN	GAT	Hop-GAT	Improvement
Cora (100%)	87.2 $\pm$ 0.4	87.5 $\pm$ 0.4	88.1 $\pm$ 0.3	88.1 $\pm$ 0.4	+ 0.0
Cora (80%)	86.8 $\pm$ 0.2	87.3 $\pm$ 0.2	86.8 $\pm$ 0.2	87.3 $\pm$ 0.4	+ 0.5
Cora (60%)	85.8 $\pm$ 0.2	86.5 $\pm$ 0.4	86.5 $\pm$ 0.2	87.1 $\pm$ 0.1	+ 0.6
Cora (40%)	84.5 $\pm$ 0.3	86.1 $\pm$ 0.1	86.0 $\pm$ 0.4	86.5 $\pm$ 0.3	+ 0.5
Cora (20%)	82.4 $\pm$ 0.3	83.0 $\pm$ 0.2	83.1 $\pm$ 0.3	83.6 $\pm$ 0.2	+ 0.5
Citeseer (100%)	78.8 $\pm$ 0.2	77.5 $\pm$ 0.2	78.4 $\pm$ 0.9	79.5 $\pm$ 0.3	+ 1.1
Citeseer (80%)	78.0 $\pm$ 0.3	76.8 $\pm$ 0.4	77.4 $\pm$ 0.9	78.1 $\pm$ 0.4	+ 0.7
Citeseer (60%)	76.9 $\pm$ 0.5	76.1 $\pm$ 0.7	77.0 $\pm$ 0.3	77.8 $\pm$ 0.5	+ 0.8
Citeseer (40%)	75.3 $\pm$ 0.3	74.9 $\pm$ 0.5	75.5 $\pm$ 0.7	76.2 $\pm$ 0.4	+ 0.7
Citeseer (20%)	72.8 $\pm$ 0.2	74.3 $\pm$ 0.4	73.9 $\pm$ 0.3	74.3 $\pm$ 0.1	+ 0.4
PubMed (100%)	87.3 $\pm$ 0.1	85.9 $\pm$ 0.3	87.5 $\pm$ 0.7	88.9 $\pm$ 0.2	+ 2.4
PubMed (80%)	87.5 $\pm$ 0.2	86.1 $\pm$ 0.4	87.5 $\pm$ 0.3	88.7 $\pm$ 0.6	+ 1.2
PubMed (60%)	86.9 $\pm$ 0.2	86.0 $\pm$ 0.4	86.9 $\pm$ 0.2	88.3 $\pm$ 0.2	+ 1.4
PubMed (40%)	86.5 $\pm$ 0.1	86.0 $\pm$ 0.4	86.0 $\pm$ 0.4	87.6 $\pm$ 0.2	+ 1.6
PubMed (20%)	86.5 $\pm$ 0.2	85.1 $\pm$ 0.3	86.2 $\pm$ 0.2	87.3 $\pm$ 0.4	+ 1.1

Table 5: Accuracy of transductive tasks.

The results of all transductive tasks are shown in Table 5. Compared with GATs, the maximum improvement for Cora, Citeseer and PubMed are 0.6%, 1.1% and 2.4% respectively. The minimum improvements are 0.0%, 0.4% and 1.1% respectively. This proves the effectiveness of our model on transductive tasks.

We also investigate the improved average performance on individual datasets. The improved average accuracy is 0.42%, 0.74% and 1.54% on the Cora, Citeseer and PubMed datasets, respectively. These datasets include 2708, 3327, and 19717 nodes, respectively. This shows that the larger the graphs are, the more benefit that the proposed model can obtain.

We can observe that the greater the number of labeled nodes, the higher the performance gain obtained on Citeseer and PubMed. This is different from the inductive task, in which fewer labeled nodes results in higher performance gains. This could be caused by the large difference between the mechanisms of inductive and transductive tasks. In the inductive tasks, validation and test nodes are completely unseen in the entire training process, whereas they directly participate in the training process in the transductive tasks. This means that with the HopGAT model, the training nodes could learn more correlations from the validation/test nodes in the transductive tasks, even when the test nodes are not labeled. In addition, the smaller the dataset is, the stronger the randomness during the sampling, which can cause a fluctuation of the performance gains in certain cases.

5.3.3 Effectiveness of Supervised Attention Coefficients

We investigate the effectiveness of the supervised attention coefficients. We thus generate a visualization of the attention coefficients produced by the proposed HopGAT model. We trained on the training set of PubMed. We record the distribution of the produced attention coefficients during the different training epochs. Figure 5 (a) comes from a head of the 1st layer, and Figure 5 (b) is from the 2nd layer. Similar to Figure 3, the horizontal axis shows the value of the attention coefficients, and the vertical axis records the occurrence number of the coefficients. Compared to Figure 3, there are three clusters that correspond to the attention coefficients for 0, 1 and greater than 1 hop value from left to right.

Comparing Figure 5 (a) to Figure 5 (b), we noticed that the attention coefficients inside the 2nd layer have more clear boundaries among different hop values than those in the 1st layer. This is consistent with the expectation that the signal received from the deeper layer is much stronger than the first layers due to back propagation.

5.3.4 Effectiveness of Learning Strategy

In this section, we evaluate the proposed learning strategy. We train the HopGAT model on PPI with a 100% label rate and the hyperparameters mentioned above. Then, we record the entire training process as a learning curve of the loss for the node classification, attention coefficients, and micro-averaged $F_{1}$ score. We also visualize the changes of $\gamma$ with increasing training epochs.

The value of $\gamma$ is used to balance the node classification and attention supervision loss functions. As shown in Figure 6 (a), at the early stage of training, i.e., the first 100 steps, more powerful supervision on the attention was imposed due to a larger $\gamma$ . Once $\mathcal{L}_{att}$ is relatively stable and when reasonable attention coefficients are established, the major energy of the learning strategy turns into $\mathcal{L}_{cls}$ alternatively, as shown in Figure 6 (b) and 6 (c). This is defined as mid-term - from 100 to 1,700 steps.

At the tail of the training process, from epoch 1,700 to 2600, we observed that the learning curve was reasonably smooth and that $\gamma$ was stable and small. $1-\gamma$ is a relatively large value, above 0.8 (one subtracts the value in Figure 6 (a)), and the node classification loss thus draws greater attention from the optimizer. The optimizer adjusts the gradient almost based on the $\mathcal{L}_{cls}$ and therefore may neglect the subsequent impact on $\mathcal{L}_{att}$ , which could result in a sharp increase in $\mathcal{L}_{att}$ . As shown in Figure 6 (c), we captured these fluctuations four times. We therefore introduce the saturation parameter $\gamma_{str}$ to resist the sudden increase in $\gamma$ . In this way, the fluctuations are handled appropriately, and better performance is achieved. Importantly, this type of fluctuation only occurs at the end of the training process, and it is not a necessary appearance for each training procedure.

In Figure 5, another interesting point is that during the first 40 epochs, $\gamma$ is relatively large, and the HopGAT thus pays more attention to the adjustment of $\mathcal{L}_{att}$ and impacts the distribution of the attention coefficients. We can observe the emergence of the three separate clusters in this phase. After the 40 epochs, $\gamma$ starts to be restricted, and the boundaries of the clusters become more clear. This provides evidence that once reasonable supervision of the attention coefficients in the early phase is applied, the subsequent learning of the node classification can be jointly performed.

6 Conclusion and Future Work

This paper proposes a hop-aware attention supervision mechanism for the node classification task. Different from the previous works, we consider the influence of the hop values between a center and its neighbor nodes. Furthermore we jointly supervise the hop-aware attention coefficients and node classification error in the loss function, by which the loss function could be trained from more information of the context nodes. This method achieves state-of-the-art classification performance. In particular, it seems more effective for the inductive task in a graph with very few labeled nodes.

In the future, there are two interesting works can be done: (1) exploring a general hop-aware model which not only performs on node classification but also on link prediction task. (2) exploring the application of hop value and attention supervision mechanism on the heterogeneous graph network.

References

[1] J. Atwood, D. Towsley, Diffusion-convolutional neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
[2] X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab, Y. Huang, S. M. Lin, W. Zhang, P. Zhang, H. Sun, Graph embedding on biomedical networks: methods, applications and evaluations, BioinformaticsBtz718 (10 2019). doi:10.1093/bioinformatics/btz718.
[3] Z. Zhang, D. Chen, Z. Wang, H. Li, L. Bai, E. R. Hancock, Depth-based subgraph convolutional auto-encoder for network representation learning, Pattern Recognition 90 (2019) 363–376.
[4] Z. Zhang, D. Chen, J. Wang, L. Bai, E. R. Hancock, Quantum-based subgraph convolutional neural networks, Pattern Recognition 88 (2019) 38–49.
[5] Y. Li, L. Guo, Z. Zhou, Towards safe weakly supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) 1–1doi:10.1109/TPAMI.2019.2922396.
[6] T.-P. Nguyen, T.-B. Ho, Detecting disease genes based on semi-supervised learning and protein–protein interaction networks, Artificial Intelligence in Medicine 54 (1) (2012) 63–71.
[7] F. Dornaika, L. Weng, Sparse graphs with smoothness constraints: application to dimensionality reduction and semi-supervised classification, Pattern Recognition 95 (2019) 285–295.
[8] Z. Yang, W. Cohen, R. Salakhudinov, Revisiting semi-supervised learning with graph embeddings, in: Proceedings of the International Conference on Machine Learning, 2016, pp. 40–48.
[9] M. Kim, D. gi Lee, H. Shin, Semi-supervised learning for hierarchically structured networks, Pattern Recognition 95 (2019) 191 – 200.
[10] Q. Zhang, J. Chang, G. Meng, S. Xu, S. Xiang, C. Pan, Learning graph structure via graph convolutional networks, Pattern Recognition 95 (2019) 308–318.
[11] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in: Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
[12] X. Li, X. Yan, Q. Gu, H. Zhou, D. Wu, J. Xu, Deepchemstable: Chemical stability prediction with an attention-based graph convolution network, Journal of Chemical Information and Modeling 59 (3) (2019) 1044–1049.
[13] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, in: Proceedings of the International Conference on Learning Representations, 2018.
[14] X. Fan, M. Gong, Y. Xie, F. Jiang, H. Li, Structured self-attention architecture for graph-level representation learning, Pattern Recognition 100 (2020) 107084.
[15] Z. Liu, W. Liu, P.-Y. Chen, C. Zhuang, C. Song, hpgat: High-order proximity informed graph attention network, IEEE Access 7 (2019) 123002–123012.
[16] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, E. Koh, Attention models in graphs: a survey, ACM Transactions on Knowledge Discovery from Data 13 (6) (2019) 62.
[17] S. Zhang, H. Tong, J. Xu, R. Maciejewski, Graph convolutional networks: algorithms, applications and open challenges, in: Proceedings of the International Conference on Computational Social Networks, Springer, 2018, pp. 79–91.
[18] B. Wu, Y. Liu, B. Lang, L. Huang, Dgcnn: Disordered graph convolutional neural network based on the gaussian mixture model, Neurocomputing 321 (2018) 346–356.
[19] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: Proceedings of the International Conference on Learning Representations, 2017.
[20] J. Bruna, W. Zaremba, A. Szlam, Y. Lecun, Spectral networks and locally connected networks on graphs, in: Proceedings of the International Conference on Learning Representations, 2014.
[21] Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang, M. Zheng, Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism, Journal of Medicinal ChemistryPMID: 31408336 (2019). doi:10.1021/acs.jmedchem.9b00959.
[22] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of the International Conference on Machine Learning, 2015, pp. 448–456.
[23] B. Knyazev, G. W. Taylor, M. Amer, Understanding attention and generalization in graph neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2019, pp. 4204–4214.
[24] M.-A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon, Multiple instance learning: a survey of problem characteristics and applications, Pattern Recognition 77 (2018) 329–353.
[25] B. Frenay, M. Verleysen, Classification in the presence of label noise: a survey, IEEE Transactions on Neural Networks and Learning Systems (2014) 845–869doi:10.1109/TNNLS.2013.2292894.
[26] Y. Shi, M. Lei, H. Yang, L. Niu, Diffusion network embedding, Pattern Recognition 88 (2019) 518–531.
[27] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Proceedings of the International Conference on Learning Representations, 2019.
[28] S. Vashishth, P. Yadav, M. Bhandari, P. Talukdar, Confidence-based graph convolutional networks for semi-supervised learning, in: Proceedings of the International Conference on Artificial Intelligence and Statistics, 2019, pp. 1792–1801.
[29] J. B. Lee, R. Rossi, X. Kong, Graph classification using structural attention, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2018, pp. 1666–1674.
[30] Y. Wang, Y. Yao, H. Tong, F. Xu, J. Lu, A brief review of network embedding, Big Data Mining and Analytics 2 (1) (2018) 35–47.
[31] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 464–468.
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
[33] U. S. Shanthamallu, J. J. Thiagarajan, H. Song, A. Spanias, Gramme: Semisupervised learning using multilayered graph attention models, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–12doi:10.1109/TNNLS.2019.2948797.
[34] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, J. Sun, Gram: graph-based attention model for healthcare representation learning, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 787–795.
[35] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI Magazine 29 (3) (2008) 93–93.
[36] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (1) (2014) 1929–1958.
[38] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[39] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), in: Proceedings of the International Conference on Learning Representations, 2016.
[40] D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings of the International Conference on Learning Representations, 2015.