Exploring High-Order Structure for Robust Graph Structure Learning

Guangqian Yang¹, Yibing Zhan³, Jinlong Li¹, Baosheng Yu², Liu Liu², Fengxiang He³
¹University of Science and Technology of China
²The University of Sydney
³JD Explore Academy JD.com

Abstract

Recent studies show that Graph Neural Networks (GNNs) are vulnerable to adversarial attack, i.e., an imperceptible structure perturbation can fool GNNs to make wrong predictions. Some researches explore specific properties of clean graphs such as the feature smoothness to defense the attack, but the analysis of it has not been well-studied. In this paper, we analyze the adversarial attack on graphs from the perspective of feature smoothness which further contributes to an efficient new adversarial defensive algorithm for GNNs. We discover that the effect of the high-order graph structure is a smoother filter for processing graph structures. Intuitively, the high-order graph structure denotes the path number between nodes, where larger number indicates closer connection, so it naturally contributes to defense the adversarial perturbation. Further, we propose a novel algorithm that incorporates the high-order structural information into the graph structure learning. We perform experiments on three popular benchmark datasets, Cora, Citeseer and Polblogs. Extensive experiments demonstrate the effectiveness of our method for defending against graph adversarial attacks.

1 Introduction

Graph Neural Networks (or GNNs) [16, 13, 28] play an important role in deep learning-based graph representation learning. By extending convolution operation to graph-structured data, graph neural networks show excellent performance in many applications, such as node classification [31, 16], link prediction [12, 22], and graph classification [20, 10].

Despite the success of GNNs in graph structure learning, recent studies have shown that GNNs are vulnerable to adversarial attacks, i.e., a small perturbation on the graph will lead to a drastic performance degradation on graph structure learning [27, 11]. Specifically, by injecting imperceptible perturbation into node feature or graph structure [5, 36, 37, 30], it can easily manipulate the prediction of GNNs. Therefore, the robustness of GNNs has received increasing attention from the community. Existing methods focusing on the robustness of GNNs can be divided into the following categories: adversarial training, robustness certification, and structure learning. This paper focus on the structure learning.

However, the low-order structure of the graph is vulnerable for defensing adversarial attacks, and the structure learning-based methods aim to mitigate the impact of adversarial attacks and help GNNs learn the true distribution of graph structures [33, 15, 32]. Compared to the initial structure, the high-order graph structure, which is reflected in the powers of the adjacency matrix, is naturally a more robust structure. Although adversarial attacks may perturb the low-order graph structures, they could hardly affect the high-order graph structures as the attacks can only change a small fraction of paths within the high-order graphs. As a result, the high-order graph structure information can naturally act as a denoising filter to make the low-order graph feature smoother.

Inspired by this, we propose a novel method that incorporates the high-order structural information into the learning process for robust graph structure learning. In this paper, we first analyze the graph adversarial attack from the perspective of graph feature smoothness, which is defined as the distance between connected nodes. And we both theoretically and empirically show that the adversarial structure perturbation essentially increases the local feature smoothness. We then devise a novel method to explore the high-order structural information for graph structure learning. Intuitively, the high-order adjacency matrix reflects the common neighbors between two nodes that could guide the structure learning. Though the adversarial attack may perturb some graph edges, it’s less likely to have much perturbation on the overall distribution of high-order graphs, and thus the high-order structure can be used to alleviate the influence of adversarial perturbations. Furthermore, we also theoretically show that high-order graph is very effective in smoothing the graph and eliminating the influence of adversarial perturbations. Our main contributions are as follows:

•

We analyze the graph adversarial attack from the perspective of feature smoothness, i.e., high-order graph structure is a smoother filter whose overall distribution is less influenced by the adversarial attack.
•

We explore the high-order graph structure to alleviate the influence of adversarial attack, which can be formulated as the normalized adjacency matrix regularization to guide the graph structure learning.
•

We conduct extensive experiments on several popular datasets using different types of attacks, and analyze the performance and sensitivity under different attack settings. Experimental results demonstrate that our method is a universal method to defense graph adversarial attacks.

2 Related Work

In this section, we briefly review recent works on graph neural networks and adversarial attacks or defense for graph neural networks.

2.1 Graph Neural Networks.

Graph Neural Networks are deep learning based methods for processing graph-structured data, and have shown excellent performance in many realistic tasks [31, 12, 20]. Generally Graph Neural Networks could be classified into spectral and spacial methods. The spectral methods generally learn graph filters based on graph spectral theory [9] first generalize Convolutional Neural Networks to graph signals based hierarchical clustering and graph Laplacian. ChebNet [7] utilizes Chebyshev polynomials as the fast localized spectral filter for computational efficiency. Graph Convolution Network [16] exploits a localized first-order approximation of spectral filters to further simply the filtering operation. The spatial methods directly propagate information based message passing in spatial domain. GraphSAGE [13] proposes an inductive learning framework on graphs that generates embeddings by sampling and aggregating neighbor information. Graph Attention Network [28] proposes to apply attention mechanism on graph so as to learn different aggregation weight for neighbors based on their dependency. There are also many state-of-the-art methods recently.

2.2 Adversarial Attack and Defense.

Deep learning models have been shown to be vulnerable to adversarial perturbation [27, 11], and so as Graph Neural Networks. A large amount researches have focus on the adversarial attack and defenses recently[34, 24, 14]. Generally, the attack models could be categorized into black box attack and white box attack based on the information the attacker could access about the model. Typical attack methods are as follows, where Nettack [36] derives an incremental attack method which utilizes approximation, RL-S2V [5] uses Q-learning to add or drop edges from the graph sequentially, NIPA [25] proposes a more practical node injection attack, GF-Attack [4] aims to attack the graph filter of given models, Metattack [36] treats the graph structure matrix as a hyperparameter to learn, IG-JSMA [29] introduces integrated gradients [26] based methods, PGD [30] utilizes projected gradient descent from a perspective of first-order optimization.

At the other end of the scale, defense methods aim to eliminate the influence of adversarial attack as much as possible. Based on the design principle, the defense models could generally be divided into certificate methods, adversarial training, and structure learning. The general idea of certificate methods is to ensure the prediction of the model not change in a certified perturbation radius, typical methods include attribute oriented [38], structure oriented [2], sparsity-aware certificate [3], collective certificate [23], and so on. Adversarial training aims to directly improves the robustness of the model by training with adversarial examples, typical methods include RAWEN [8], DWNS [6]. And Structure learning methods aim to learn graph structure from the perturbed graph, typical methods include NeuralSparse [33] that learns to remove potentially task-irrelevant edges from input graphs, Pro-GNN [15] that explores graph properties of sparsity, low rank and feature smoothness, and GNNGUARD [32] that exploits the relationship between the graph structure and node features to mitigate negative effects of attack.

Refer to caption — Figure 1: The work-flow of the our method. We use the initial input graph (perturbed) to learn a estimated graph and to obtain the high-order graph. Here the high-order graph can be viewed as a weighted graph, where nodes with larger connectivity have closer links. The high-order graph can be further used to guide the graph structure learning process.

3 Methods

In this section, we first introduce the notations used in this paper. We then analyze how high-order structure helps defending adversarial attacks on graph. Lastly, we introduce the proposed high-order structure learning.

3.1 Preliminaries

We introduce the notations used in this paper as follows. Specifically, we use bold upper case letters for matrix, bold lower case letters for vector, and regular letters for scalar. Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ is the set of nodes, $\mathcal{E}$ is the set of edges, let $N=|\mathcal{V}|$ denote the number of nodes. For each node $v_{i}\in\mathcal{V}$ , we have a corresponding attribute feature vector $\mathbf{x_{i}}\in\mathbb{R}^{F}$ , and all nodes them form an attribute feature matrix $\mathbf{X}\in\mathbb{R}^{N\times F}$ , where $F$ denotes the dimension of the attribute. In addition, the adjacency matrix of $\mathcal{G}$ is defined as $\mathbf{A}\in\mathbb{R}^{N\times N}$ , where $\mathbf{A}_{ij}=1$ indicates an edge between node $v_{i}$ and $v_{j}$ , and $\mathbf{A}_{ij}=0$ indicates no edge. For the node classification task, GNNs can be seen a function with parameters $\mathbf{\theta}$ that map the nodes into different classes, using the adjacency matrix $\mathbf{A}$ and the feature matrix $\mathbf{X}$ as inputs. A general two-layer graph convolution network can be formulated as $f_{\theta}(\mathbf{A},\mathbf{X})=softmax(\mathbf{\hat{A}}\sigma(\mathbf{\hat{A}XW_{1}})\mathbf{W_{2}})$ , where $\mathbf{\hat{A}}=\mathbf{\tilde{D}}^{-1/2}\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1/2}$ , $\mathbf{\tilde{A}}=\mathbf{A}+\mathbf{I}$ and $\mathbf{\tilde{D}}$ is a diagonal matrix where $\mathbf{\tilde{D}}_{ii}=\sum_{j}{\mathbf{\tilde{A}}_{ij}}$ , $\mathbf{W_{1}}$ and $\mathbf{W_{2}}$ indicate trainable weight matrices, and $\sigma$ is the activation function such as ReLU [19].

Given an adjacency matrix $\mathbf{A}$ poisoned by the adversarial attack, our goal is to learn the estimated adjacency matrix $\mathbf{S}\in\mathbb{R}^{N\times N}$ that could alleviate the influence of attack, and the GNN parameters $\mathbf{\theta}$ for downstream tasks. Consistent with previous symbols, we use $\mathbf{\hat{S}}=\mathbf{\tilde{D}}^{-1/2}\mathbf{\tilde{S}}\mathbf{\tilde{D}}^{-1/2}$ to denote the normalized estimated adjacency matrix. The Laplacian of the graph is defined as $\mathbf{L}=\mathbf{\tilde{D}}-\mathbf{\tilde{A}}$ .

3.2 Adversarial Attacks Increase Smoothness

As illustrated in previous work [35], the attack losses increase only when removing a homophilous edge, or adding a heterophilous edge to node $v$ . In this subsection, we analyze the adversarial attack problem from the perspective of feature smoothness. We both theoretically and empirically show that the above-mentioned attacks actually increase the feature smoothness of graph.

Similar to [15], we define the average graph feature smoothness as follows:

s=\frac{1}{\|\mathbf{\tilde{A}}\|_{1}}\sum_{i=1}^{N}\sum_{j=1}^{N}\mathbf{\tilde{A}}_{ij}(\mathbf{x}_{i}-\mathbf{x}_{j})^{2}=\frac{1}{\|\mathbf{\tilde{A}}\|_{1}}tr(\mathbf{X}^{T}\mathbf{L}\mathbf{X}),

(1)

where the $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are the feature of nodes $v_{i}$ and $v_{j}$ , respectively. This formula utilizes the difference of node features to depict the overall graph smoothness, where larger distance indicates larger smoothness.

Proposition 1

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ be a graph with adjacency matrix $\mathbf{A}$ and feature matrix $\mathbf{X}$ . The node features of $v_{i}$ is $\mathbf{x}_{i}=p\cdot onehot(y_{i})+\frac{1-p}{|y|}\cdot\mathbf{1}$ , where $\mathbf{1}$ is a vector with all elements equal to $1$ , and $p$ is the uniform noise strength. We also assume a homophilous edge fraction $h$ of each neighbors that belongs to the same class and $h>\frac{1}{|\mathcal{Y}|}$ holds. The edge set $\mathcal{E}$ can be divided into two sets $\mathcal{E}_{homo}$ and $\mathcal{E}_{hetero}$ , where $\mathcal{E}_{homo}\cup\mathcal{E}_{hetero}=\mathcal{E}$ and $\mathcal{E}_{homo}\cap\mathcal{E}_{hetero}=\emptyset$ . The adversarial attack leads to the increase of feature smoothness.

Proof 3.1.

We consider a targeted attack, and the local smoothness is defined as:

	$\displaystyle s_{i}$	$\displaystyle=\frac{1}{\\|\mathbf{\tilde{A}}_{i}\\|_{1}}\sum_{j=1}^{N}\mathbf{\tilde{A}}_{ij}(\mathbf{x}_{i}-\mathbf{x}_{j})^{2}$
		$\displaystyle=\frac{1}{\tilde{d}_{i}}(\tilde{d}_{i}\cdot h\cdot s_{1}+\tilde{d}_{i}\cdot(1-h)\cdot s_{2}),$		(2)

where $\tilde{d}_{i}=d_{i}+1$ is the degree of node $v_{i}$ , $s_{1}=(\mathbf{x}_{i}-\mathbf{x}_{j})^{2}=0$ where $e_{ij}\in\mathcal{E}_{homo}$ , and $s_{2}=(\mathbf{x}_{i}-\mathbf{x}_{j})^{2}=2p^{2}$ where $e_{ij}\in\mathcal{E}_{hetero}$ . We then add adversarial perturbation on the graph. To simplify the graph without loss of generality, we only modify one edge to see how the smoothness changes as follows.

•

Add a heterophilous edge. The local feature smoothness of node $v_{i}$ becomes:

	$\displaystyle s^{\prime}_{i}$	$\displaystyle=\frac{1}{\tilde{d}_{i}+1}(\tilde{d}_{i}\cdot h\cdot s_{1}+(\tilde{d}_{i}\cdot(1-h)+1)\cdot s_{2})$
		$\displaystyle=(1-\frac{\tilde{d}_{i}}{\tilde{d}_{i}+1}\cdot h)\cdot 2p^{2}$
		$\displaystyle>\frac{1}{\tilde{d}_{i}}(\tilde{d}_{i}\cdot h\cdot s_{1}+\tilde{d}_{i}\cdot(1-h)\cdot s_{2})=s_{i}.$

•

Delete a homophilous edge. The local feature smoothness of node $v_{i}$ becomes:

	$\displaystyle s^{\prime}_{i}$	$\displaystyle=\frac{1}{\tilde{d}_{i}-1}((\tilde{d}_{i}\cdot h-1)\cdot s_{1}+\tilde{d}_{i}\cdot(1-h)\cdot s_{2})$
		$\displaystyle=\frac{\tilde{d}_{i}}{\tilde{d}_{i}-1}(1-h)\cdot 2p^{2}$
		$\displaystyle>\frac{1}{\tilde{d}_{i}}(\tilde{d}_{i}\cdot h\cdot s_{1}+\tilde{d}_{i}\cdot(1-h)\cdot s_{2})=s_{i}.$

Therefore, we have that either adding a heterophilous edge or deleting a homophilous edge will increase of local feature smoothness.

3.2.1 Empirical Evidence.

Intuitively, adding a heterophilous edge will cause the feature of node $v_{i}$ aggregated with features different from the node, which makes the feature move towards the features of different label to cause the misclassification, and deleting a homophilous edge is on the contrary. To validate our analysis, we conduct some experiments and calculate the graph smoothness by $\frac{1}{\|\mathbf{A}\|_{1}}tr(\mathbf{X}^{T}\mathbf{L}\mathbf{X})$ . Here we conduct attack experiments with different perturbation rate on several datasets. As shown in Figure 2, we find that the feature smoothness increases when increasing the perturbation rate, which is consistent with our proposition. This phenomenon also indicates that the adversarial attack essentially increases the local smoothness of graph to fool the model into making wrong prediction.

3.3 High-Order Graph is a Smooth Filter

High-order graph indicates the powers of the normalized matrix. Intuitively, the perturbation of high-order structure is much less than the initial graph structure. We have the normalized graph Laplacian as follows:

\mathbf{\hat{L}}=\mathbf{\tilde{D}}^{-1/2}(\mathbf{\tilde{D}}-\mathbf{\tilde{A}})\mathbf{\tilde{D}}^{-1/2}=\mathbf{I}-\mathbf{\hat{A}},

(3)

where the symmetric graph Laplacian is positive semi-definite.

Theorem 3.2.

Let $\mathbf{\tilde{A}}=\mathbf{A}+\mathbf{I}$ be the adjacency matrix with self-loop. Denote the eigenvalues of $\mathbf{\tilde{A}}$ as $\lambda_{1}\geq\lambda_{2}\geq...\geq\lambda_{n}$ . For each eigenvalue $\lambda$ , we have that the eigenvalues satisfy $|\lambda|\leq 1$ , i.e., $|\lambda^{k}|\leq|\lambda|$ for any $k\geq 1$ .

Proof 3.3.

Let $\mathbf{u}$ be any real vector of unit norm, then we have

	$\displaystyle\mathbf{u}^{T}\mathbf{\hat{L}}\mathbf{u}$	$\displaystyle=\frac{1}{2}\sum_{i,j=1}^{N}\mathbf{\tilde{A}}_{ij}(\mathbf{\tilde{D}}^{-1/2}u_{i}-\mathbf{\tilde{D}}^{-1/2}u_{j})^{2}$
		$\displaystyle\leq\sum_{i,j=1}^{N}\mathbf{\tilde{A}}_{ij}(\mathbf{\tilde{D}}^{-1}u_{i}^{2}+\mathbf{\tilde{D}}^{-1}u_{j}^{2})=2\mathbf{\tilde{A}}_{ij}u_{i}^{2}\mathbf{\tilde{D}}^{-1}$
		$\displaystyle=2\sum_{i=1}^{N}u_{i}^{2}\mathbf{\tilde{D}}^{-1}\sum_{j=1}^{N}\mathbf{\tilde{A}}_{ij}=2\sum_{i=1}^{N}u_{i}^{2}\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{D}}$
		$\displaystyle=2\sum_{i=1}^{N}u_{i}^{2}=2.$

Therefore, we have that the largest eigenvalue of $\mathbf{\tilde{L}}$ is upper bounded by 2, i.e., the eigenvalue of $\mathbf{\tilde{A}}$ satisfy $|\lambda|\leq 1$ and $|\lambda^{k}|\leq|\lambda|$ for any $k\geq 1$ . Note that the $0$ is the largest eigenvalue of $\mathbf{\tilde{L}}$ , so $1$ is the eigenvalue of $\mathbf{\tilde{A}}$ , and for the rest of the eigenvalues we have $|\lambda^{k}|\leq|\lambda|$ .

From Theorem 1, we know that by taking powers $K>1$ , the spectrum allows the filter to act as a low-pass type filter. This filter makes the frequencies of graph signal become lower, i.e., the local graph feature becomes smoother.

3.4 High-Order Structure Learning

In this subsection, we explore the high-order normalized adjacency matrix for graph structure learning, which can naturally preserve the high-order structural similarity and smooth the graph. The high-order structure learning can be formulated as:

\mathcal{L}_{s}=\eta_{1}\|\mathbf{\hat{S}}-\mathbf{\hat{A}}\|_{F}^{2}+\sum_{k=2}^{K}\eta_{k}\|\mathbf{\hat{S}}-\mathbf{\hat{A}}^{k}\|_{F}^{2},

(4)

where the first term $\eta_{1}\|\mathbf{\hat{S}}-\mathbf{\hat{A}}\|_{F}$ minimizes the difference between the learned adjacency matrix and the perturbed adjacency matrix. High-order terms are used to improve the structure learning. One notable difference between second- and third-order is the symbol of eigenvalue, where the second-order filter has only positive eigenvalues. Though the filter becomes smoother with $k$ becoming larger, it also leads to the over-smoothing problem [17], where the representations become too smooth to distinguish. In practice, we use $K=2$ or $K=3$ . From the geometric view, the element of high-order normalized adjacency matrix $\mathbf{\tilde{A}}^{k}_{ij}$ denotes the $k$ -hop transition probability between two nodes $v_{i}$ and $v_{j}$ , where node pairs with larger connectivity are assigned with larger weights. Intuitively, though the adversarial attack may add/delete a edge between node $v_{i}$ and $v_{j}$ , it has less influence on the distribution of the high-order adjacency power. Therefore, exploring the high-order adjacency matrix can be helpful to alleviate the influence of adversarial attack.

Corollary 3.4.

Let $\mathbf{L}=\mathbf{\tilde{D}}-\mathbf{\tilde{A}}$ be the graph Laplacian defined as $\mathbf{L}=\mathbf{\tilde{D}}-\mathbf{\tilde{A}}$ , and $\mathbf{L}_{2}=\mathbf{\tilde{D}}-\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{A}}$ be the 2-order graph Laplacian. For an identity feature matrix $\mathbf{X}$ , we have $tr(\mathbf{X}^{T}(\mathbf{L}-\mathbf{L}_{2})\mathbf{X})>0$ .

Proof 3.5.

According to the properties of Laplacian matrix, we have:

	$\displaystyle tr(\mathbf{L})-tr(\mathbf{L}_{2})$	$\displaystyle=tr(\mathbf{\tilde{D}}-\mathbf{\tilde{A}})-tr(\mathbf{\tilde{D}}-\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{A}})$
		$\displaystyle=tr(\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{A}}-\mathbf{\tilde{A}})$
		$\displaystyle=tr(\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{A}})-tr(\mathbf{\tilde{A}})$
		$\displaystyle=tr(\mathbf{\tilde{D}}^{-1}\mathbf{\tilde{A}}^{2})-tr(\mathbf{\tilde{A}})$
		$\displaystyle=\sum_{i=1}^{N}(\frac{d_{i}+1}{d_{i}})-N$
		$\displaystyle=\sum_{i=1}^{N}\frac{1}{d_{i}}>0.$

Here the second last equation is because that from the diagonal elements of $\mathbf{\tilde{A}}^{2}$ is the paths number from node $v_{i}$ to itself, apparently that in a graph with self-loops there are $d_{i}+1$ paths, where $d_{i}$ is the degree of node $v_{i}$ . For an identity feature matrix $\mathbf{X}$ , we have $\mathbf{X}^{T}(\mathbf{L}-\mathbf{L}_{2})\mathbf{X}$ similar to $(\mathbf{L}-\mathbf{L}_{2})$ , so that $tr(\mathbf{X}^{T}(\mathbf{L}-\mathbf{L}_{2})\mathbf{X})=tr(\mathbf{L}-\mathbf{L}_{2})>0$ .

From the above corollary, we see that the trace of $\mathbf{X}^{T}\mathbf{L}\mathbf{X}$ is larger than $\mathbf{X}^{T}\mathbf{L}_{2}\mathbf{X}$ , which indicates the second-order filter smoothes the local structure.

3.4.1 Overall Loss Function

Since the natural graphs always exhibit sparsity and low-rank properties [15], we also add sparsity and low-rank regularization terms on structure learning. And we could also add the feature smoothness term when features are available. So that the overall loss function could be formulated by:

	$\displaystyle\mathcal{L}=$	$\displaystyle\alpha\\|\mathbf{S}\\|_{1}+\beta\\|\mathbf{S}\\|_{*}+\sum_{k=1}^{K}\eta_{k}\\|\mathbf{\hat{S}}-\mathbf{\hat{A}}^{k}\\|_{F}^{2}$
		$\displaystyle+\lambda tr((\mathbf{X}^{T}\mathbf{\hat{L}}\mathbf{X})+\mathcal{L}_{GNN},$

where the term $\mathcal{L}_{GNN}$ is the classification loss of the GNN such as cross entropy, $\|\cdot\|_{*}$ is the nuclear norm. We iteratively optimize the parameters of GNN and the learned adjacency matrix $\mathbf{S}$ . The optimization algorithm is shown in Algorithm 1, where we use proximal operator for the optimization of $l_{1}$ norm and nuclear norm [1, 21], $P(\mathbf{S})$ is a projection function that project $\mathbf{S}_{ij}<0$ to $0$ and $\mathbf{S}_{ij}>1$ to $1$ .

Algorithm 1 Optimization Algorithm.

Input: Adjacency matrix $\mathbf{A}$ , feature matrix $\mathbf{X}$ , labels $\mathbf{Y}_{L}$ , hyper-parameters $\alpha$ , $\beta$ , $\lambda$ , $\tau$ , $\eta$ , Learning rate $\mu$ , $\mu^{\prime}$
Output: Learned adjacency matrix $\mathbf{S}$ , GNN parameters $\theta$ .

1: Initialize

\mathbf{S}\leftarrow\mathbf{A}

2: Randomly initialize

\theta

3: while stopping condition is not met do

\mathbf{S}\leftarrow\mathbf{S}-\mu\nabla_{\mathbf{S}}(\sum_{k=1}^{K}\eta_{k}\|\mathbf{\hat{S}}-\mathbf{\hat{A}}^{k}\|_{F}^{2}+\lambda tr((\mathbf{X}^{T}\mathbf{\hat{L}}\mathbf{X})+\mathcal{L}_{GNN})

\mathbf{S}\leftarrow\mathbf{U}diag((\sigma_{i}-\beta)_{+}))_{i}\mathbf{V}^{T}

\mathbf{S}\leftarrow sgn(\mathbf{S})\odot(|\mathbf{S}|-\alpha)_{+}

\mathbf{S}\leftarrow P(\mathbf{S})

8: for

i=1

\tau

\theta\leftarrow\theta-\mu^{\prime}\nabla_{\theta}\mathcal{L}_{GNN}

10: end for

11: end while

12: return

\mathbf{S}

\theta

4 Experiments

In this section, we evaluate the effectiveness of our proposed method against different types of adversarial attacks. We compare our method with several state-of-the-art methods, and analyze the parameter sensitivity of our method.

4.1 Experimental Settings

4.1.1 Datasets

We evaluate our method on three benchmark datasets, i.e., Cora, Citeseer and Polblogs, as shown in Table 1. Specifically, Cora and Citeseer are citation networks and they all have features, while Polblogs is a web network dataset whose features are not available.

	Nodes	Edges	Features	Classes
Cora	2,708	5,429	1,433	7
Citeseer	3,327	4,732	3,703	6
Polblogs	1,222	16,714	-	2

Table 1: Description of datasets used for node classification.

4.1.2 Implementation Details

To validate the effectiveness of our method, we compare it with several state-of-the-art defense methods. Our experiments are conducted based on the adversarial attack repository DeepRobust [18]. Following the classical semi-supervised classification setting, we randomly choose $10\%$ of the nodes for training, $10\%$ for validation and $80\%$ for testing for each graph. For each experiment, we run 10 times and report the mean performance and variance of each method. We adopt the default parameter settings for the baseline methods. We use $K=3$ for all the datasets for implementation.

Dataset	Ptb Rate	GCN	GAT	RGCN	Jaccard	ProGNN	ElasticGNN	Ours
Cora	0.00	$83.5\pm 0.4$	$84.0\pm 0.7$	$83.1\pm 0.4$	$82.1\pm 0.5$	$83.4\pm 0.5$	$\mathbf{85.8\pm 0.4}$	$83.4\pm 0.5$
	0.05	$76.6\pm 0.8$	$80.4\pm 0.7$	$77.4\pm 0.4$	$79.1\pm 0.6$	$82.8\pm 0.4$	$82.3\pm 1.1$	$\mathbf{82.8\pm 0.4}$
	0.10	$70.4\pm 1.3$	$75.6\pm 0.6$	$72.2\pm 0.4$	$75.2\pm 0.8$	$79.0\pm 0.6$	$78.8\pm 1.7$	$\mathbf{79.3\pm 1.3}$
	0.15	$65.1\pm 0.7$	$69.8\pm 1.3$	$66.8\pm 0.4$	$71.0\pm 0.6$	$76.4\pm 1.3$	$77.2\pm 1.6$	$\mathbf{77.9\pm 1.4}$
	0.20	$59.6\pm 2.7$	$59.9\pm 0.9$	$59.3\pm 0.4$	$65.7\pm 0.9$	$73.3\pm 1.6$	$70.4\pm 1.3$	$\mathbf{73.3\pm 1.4}$
	0.25	$47.5\pm 2.0$	$54.8\pm 0.7$	$50.5\pm 0.8$	$60.8\pm 1.1$	$69.7\pm 1.7$	$-$	$\mathbf{71.5\pm 1.5}$
Citeseer	0.00	$72.0\pm 0.6$	$73.3\pm 0.8$	$71.2\pm 0.8$	$72.1\pm 0.6$	$73.3\pm 0.7$	$\mathbf{73.8\pm 0.6}$	$73.2\pm 0.7$
	0.05	$70.9\pm 0.6$	$72.9\pm 0.8$	$70.5\pm 0.4$	$70.5\pm 1.0$	$73.1\pm 0.3$	$73.3\pm 0.6$	$\mathbf{73.3\pm 0.7}$
	0.10	$67.6\pm 0.9$	$70.6\pm 0.5$	$67.7\pm 0.3$	$69.5\pm 0.6$	$72.5\pm 0.8$	$72.4\pm 0.9$	$\mathbf{72.6\pm 0.3}$
	0.15	$64.5\pm 0.1$	$69.0\pm 1.1$	$65.7\pm 0.4$	$66.0\pm 0.9$	$72.0\pm 1.1$	$71.3\pm 1.5$	$\mathbf{72.2\pm 0.6}$
	0.20	$62.0\pm 3.5$	$61.0\pm 1.5$	$62.5\pm 1.2$	$59.3\pm 1.4$	$70.0\pm 2.3$	$64.9\pm 1.0$	$\mathbf{71.5\pm 0.7}$
	0.25	$56.9\pm 2.1$	$61.9\pm 1.1$	$55.4\pm 0.7$	$59.9\pm 1.5$	$69.0\pm 2.8$	$-$	$\mathbf{70.8\pm 1.6}$
Polblogs	0.00	$95.7\pm 0.4$	$95.4\pm 0.2$	$95.2\pm 0.1$	$-$	$93.2\pm 0.6$	$\mathbf{95.8\pm 0.3}$	$94.2\pm 1.3$
	0.05	$73.1\pm 0.8$	$83.7\pm 1.5$	$74.3\pm 0.2$	$-$	$93.3\pm 0.2$	$83.0\pm 0.3$	$\mathbf{93.9\pm 2.8}$
	0.10	$70.7\pm 1.1$	$76.3\pm 0.9$	$71.0\pm 0.3$	$-$	$89.4\pm 1.1$	$81.6\pm 0.3$	$\mathbf{90.0\pm 3.6}$
	0.15	$65.0\pm 1.9$	$68.8\pm 1.1$	$67.3\pm 0.4$	$-$	$86.0\pm 2.2$	$78.7\pm 0.5$	$\mathbf{88.2\pm 5.2}$
	0.20	$51.3\pm 1.2$	$51.5\pm 1.6$	$59.9\pm 0.3$	$-$	$79.6\pm 5.7$	$77.5\pm 0.2$	$\mathbf{86.8\pm 6.2}$
	0.25	$49.2\pm 1.4$	$51.2\pm 1.5$	$56.0\pm 0.6$	$-$	$63.2\pm 4.4$	$-$	$\mathbf{83.0\pm 5.0}$

Table 2: Experiment results of node classification tasks against Metattack.

4.2 Defense Performance

We conduct experiments on three attack settings, including non-targeted attack, targeted attack and random attack.

•

Non-Targeted Attack. It considers the whole graph and aims to degrade the overall performance. We adopt a recent state-of-the-art attack, Metattack [37].
•

Targeted Attack. It aims to attack specific nodes. We adopt a recent state-of-the-art attack, Nettack [36].
•

Random Attack. It randomly perturbs the graph structure, which injects a random noise.

4.2.1 Non-targeted Attack

We first evaluate the performance of each method against non-targeted attack. In this experiment, we adopt the Metatack for implementation, and keep all the parameter settings in original paper. To test the performance under different degrees of perturbation, we vary the perturbation rate from $0$ to $25\%$ with the step size of $5\%$ . For each experiment, we run $10$ times with different seeds and report the mean accuracy and variance of each method, which is shown in Table 2. We highlight the best performance in bold under each degree of perturbation rate. From this table, we could make the following observations:

•

Our method constantly outperforms other methods under different attack degrees. For example, on each dataset, our method have different degrees of improvement compared to ProGNN. And in larger perturbation rate, our method have more distinctive improvement. Compared to vanilla GCN, our method improves $24.0\%$ , $18.9\%$ and $33.8\%$ on cora, citeseer and polblogs respectively.
•

On datasets without features, our method improves more than the baseline methods. For example, on polblogs, our method improves rather greatly compared to these methods. Especially under $25\%$ perturbation rate, our method improves $19.8\%$ compared to baselines.

4.2.2 Targeted Attack

We then evaluate the performance of each method against targeted attack. In this experiment, we adopt the Nettack for attack implementation and use the default parameter settings. We vary the number of perturbations on every targeted node from 1 to 5 with the step size of 1, and the nodes with degree larger than $10$ are choosed as targets. As shown in Figure 3, we find that our method out performs the baseline methods under different degree of attacks. Moreover, compared to baselines, our method reduces the decline rate of performance with the increase of perturbation numbers.

4.2.3 Random Attack

We evaluate the performance of each method against random attack. In this experiment, we vary the random noise from $0$ to $100\%$ with the step size of $20\%$ . As shown in Figure 4, we find that our method also performs well under random attack. Still, our method performs better under larger perturbation rates, which indicates that our method can better defense against random noise.

4.3 Parameter Analysis

In this part, we explore the sensitivity of hyper-parameters of each order $\eta_{k}$ for our method. We first set the $\eta_{k}$ fixed and see how well they performed with the change of perturbation rate. We then set the perturbation rate fixed and check the sensitivity of each hyper-parameter.

4.3.1 Parameter Effect

In this experiment, we solely use one order of regularization term, and see how they affect the performance of our method. The experimental result is shown in Figure 5. As shown in Figure 5, we can observe that when the perturbation rate is small, the low-order structure hyper-parameter $\eta_{1}$ contributes more for the classification performance. As the perturbation rates increases, the effect of $\eta_{1}$ drops sharply, and the high-order structure hyper-parameter $\eta_{2}$ and $\eta_{3}$ gradually performs better than the low-order structure. For the high-order parameter, $\eta_{3}$ performs better when the perturbation is at a small level, while $\eta_{2}$ has better performance when the perturbation is relatively large.

4.3.2 Parameter Sensitivity

We then analyze the sensitivity of different hyper-parameters. In this part, we set the perturbation rate fixed, and analyze the performance by adjusting the value of hyper-parameters. For illustration, we use $0.25$ as the perturbation rate for all datasets, and tune the hyper-parameter to see the performance. The result is shown in Figure 6.

As shown in Figure 6, we could observe that the accuracy of can be boosted when choosing appropriate values for all the hyper-parameters. For all of these three datasets, $\eta_{2}$ always get the best performance; $\eta_{3}$ is more sensitive to the value of hyper-parameter; $\eta_{1}$ seems contribute less at a high-level perturbation compared to the other two parameters. Specifically, $\eta_{1}$ seems to help very little on Citeseer dataset at a high perturbation level, while $\eta_{2}$ and $\eta_{3}$ have similar effects.

5 Conclusion

Graph Neural Networks are vulnerable to the adversarial attack, where a small structure perturbation can fool GNNs into making wrong predictions. To improve the robustness of GNNs, we analyze the adversarial attack problem from the perspective of feature smoothing. We prove that high-order graph is smooth filter, which can be used to defense the adversarial attacks on graph. Therefore, we propose a novel structure learning method which explores the high-order structure of graph to help the learning process. Extensive experiments on defensing graph adversarial attacks show that our method can effectively improve the robustness of GNNs, and the proposed method outperforms recent state-of-the-art methods with a clear margin.

References

[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[2] A. Bojchevski and S. Günnemann. Certifiable robustness to graph perturbations. arXiv preprint arXiv:1910.14356, 2019.
[3] A. Bojchevski, J. Klicpera, and S. Günnemann. Efficient robustness certificates for discrete data: Sparsity-aware randomized smoothing for graphs, images and more. In International Conference on Machine Learning, pages 1003–1013. PMLR, 2020.
[4] H. Chang, Y. Rong, T. Xu, W. Huang, H. Zhang, P. Cui, W. Zhu, and J. Huang. A restricted black-box adversarial framework towards attacking graph embedding models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3389–3396, 2020.
[5] H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song. Adversarial attack on graph structured data. In International conference on machine learning, pages 1115–1124. PMLR, 2018.
[6] Q. Dai, X. Shen, L. Zhang, Q. Li, and D. Wang. Adversarial training methods for network embedding. In The World Wide Web Conference, pages 329–339, 2019.
[7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3844–3852, 2016.
[8] D. Ding, M. Zhang, X. Pan, M. Yang, and X. He. Improving the robustness of wasserstein embedding by adversarial pac-bayesian learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3791–3800, 2020.
[9] J. B. Estrach, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and deep locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR, volume 2014, 2014.
[10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
[11] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[12] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
[13] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025–1035, 2017.
[14] F. He, S. Fu, B. Wang, and D. Tao. Robustness, privacy, and generalization of adversarial training. arXiv preprint arXiv:2012.13573, 2020.
[15] W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 66–74, 2020.
[16] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
[17] Q. Li, Z. Han, and X.-M. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence, 2018.
[18] Y. Li, W. Jin, H. Xu, and J. Tang. Deeprobust: A pytorch library for adversarial attacks and defenses. arXiv preprint arXiv:2005.06149, 2020.
[19] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
[20] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023. PMLR, 2016.
[21] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 51–58, 2012.
[22] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.
[23] J. Schuchardt, A. Bojchevski, J. Klicpera, and S. Günnemann. Collective robustness certificates: Exploiting interdependence in graph neural networks. In International Conference on Learning Representations, 2020.
[24] L. Sun, Y. Dou, C. Yang, J. Wang, P. S. Yu, L. He, and B. Li. Adversarial attack and defense on graph data: A survey. arXiv preprint arXiv:1812.10528, 2018.
[25] Y. Sun, S. Wang, X. Tang, T.-Y. Hsieh, and V. Honavar. Non-target-specific node injection attacks on graph neural networks: A hierarchical reinforcement learning approach. In Proc. WWW, volume 3, 2020.
[26] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328. PMLR, 2017.
[27] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[28] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.
[29] H. Wu, C. Wang, Y. Tyshetskiy, A. Docherty, K. Lu, and L. Zhu. Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint arXiv:1903.01610, 2019.
[30] K. Xu, H. Chen, S. Liu, P. Y. Chen, T. W. Weng, M. Hong, and X. Lin. Topology attack and defense for graph neural networks: An optimization perspective. In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pages 3961–3967. International Joint Conferences on Artificial Intelligence, 2019.
[31] Z. Yang, W. Cohen, and R. Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pages 40–48. PMLR, 2016.
[32] X. Zhang and M. Zitnik. Gnnguard: Defending graph neural networks against adversarial attacks. Advances in Neural Information Processing Systems, 33, 2020.
[33] C. Zheng, B. Zong, W. Cheng, D. Song, J. Ni, W. Yu, H. Chen, and W. Wang. Robust graph representation learning via neural sparsification. In International Conference on Machine Learning, pages 11458–11468. PMLR, 2020.
[34] Q. Zheng, X. Zou, Y. Dong, Y. Cen, D. Yin, and J. Tang. Graph robustness benchmark: Rethinking and benchmarking adversarial robustness of graph neural networks. 2021.
[35] J. Zhu, J. Jin, M. T. Schaub, and D. Koutra. Improving robustness of graph neural networks with heterophily-inspired designs. arXiv preprint arXiv:2106.07767, 2021.
[36] D. Zügner, A. Akbarnejad, and S. Günnemann. Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2847–2856, 2018.
[37] D. Zügner and S. Günnemann. Adversarial attacks on graph neural networks via meta learning. In International Conference on Learning Representations, 2018.
[38] D. Zügner and S. Günnemann. Certifiable robustness and robust training for graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 246–256, 2019.