This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Adversarial Attack on Graph Neural Networks
with Alternating Direction Method of Multipliers

Boyuan Feng, Yuke Wang, Xu Li, and Yufei Ding
Abstract

Graph neural networks (GNNs) have achieved high performance in analyzing graph-structured data and have been widely deployed in safety-critical areas, such as finance and autonomous driving. However, only a few works have explored GNNs’ robustness to adversarial attacks, and their designs are usually limited by the scale of input datasets (i.e., focusing on small graphs with only thousands of nodes). In this work, we propose, SAG, the first scalable adversarial attack method with Alternating Direction Method of Multipliers (ADMM). We first decouple the large-scale graph into several smaller graph partitions and cast the original problem into several subproblems. Then, we propose to solve these subproblems using projected gradient descent on both the graph topology and the node features that lead to considerably lower memory consumption compared to the conventional attack methods. Rigorous experiments further demonstrate that SAG can significantly reduce the computation and memory overhead compared with the state-of-the-art approach, making SAG applicable towards graphs with large size of nodes and edges.

Introduction

Graph neural networks (GNNs) play an essential role in many emerging applications with graph data. On these applications, GNNs show their strength in extracting valuable information from both the features (i.e., information from individual nodes) and the topology (i.e., the relationship between nodes). For example, GNNs can effectively analyze financial data to decide loan-related policies. Another example is the social network with billions of users where GNNs can do friendship recommendation.

Such wide deployment of GNNs motivates the investigation of their robustness and reliability. One of the key aspects is to effectively generate adversarial examples on graph data, so that we can better understand the “weakness” of GNNs and secure them more wisely afterward. Our initial exploration and studies identify several key properties to be considered in the GNN attack. First, the adversarial example needs to consider topology and feature information to comprehensively attack the GNNs on all perspectives. Second, the attack method needs to be efficient in both memory and computation for catering to the huge number of nodes in graph data.

However, existing work inevitably falls in short at least one of the above aspects, as summarized in Table 1. Specifically, FGSM (Goodfellow, Shlens, and Szegedy 2015) is crafted for attacking traditional deep neural networks (DNNs). Even though it can attack node embeddings with good computation and memory efficiency, it can not well support the graph topology, which distinguishes GNNs from DNNs. PGD (Xu et al. 2019a) is one of the state-of-the-art work designated for the GNN attack. However, it does not support the attack on node features, and it suffers a quadratic memory overhead due to maintaining the large size of edge gradients in a dense matrix format. For example, a graph with NN nodes leads to a dense N×NN\times N gradient matrix, consuming more than 1010GB memory for a moderate-size graph with only 50,00050,000 nodes. Another work, Nettack (Zügner, Akbarnejad, and Günnemann 2018), while performing well in three aspects, finds its shortcoming in computation efficiency due to a fine-grained edge manipulation in a very inefficient trial-and-error manner.

Table 1: Comparison with the Existing Attack Methods.
Method Topology Feature Comp. Effi. Mem. Effi.
  FGSM
PGD
Nettack
SAG

To overcome these challenges, we propose SAG, the first ADMM-attack on GNNs with the alternating direction method of multipliers (ADMM), which iteratively maximizes the training loss through modifying the graph topology and the node features. ADMM has been shown to be effective in dividing-and-conquering a large non-convex optimization problem into multiple smaller ones for achieving both efficiency and scalability (Fan et al. 2017; Leng et al. 2018; Liu et al. 2020; Ren et al. 2019). As shown in the last line of Table 1, SAG can attack both the graph topology and the node features, while maintaining computation and memory efficiency to a great extent, making it a viable and promising solution for large-scale graphs.

In summary, our major contributions are:

  • We identify and analyze the key properties of the effective adversarial attacks on GNNs, where none of the existing methods could address all of them systematically and comprehensively.

  • We propose SAG to effectively generate adversarial examples on graph neural networks based on both topology attack and feature attack. We formulate SAG with the ADMM optimization framework and achieve both high memory and computation efficiency.

  • Evaluation shows our proposed method can launch more effective attacks compared with the state-of-the-art approaches while reducing the computation and memory overhead, making it applicable towards large graph settings.

Related Work

Graph Adversarial Attacks. Graph adversarial attacks aim to maximize the accuracy drop on GNN models by introducing the perturbations, such as the modifications of the graph topology and the node representations (feature embeddings). Existing GNN attacks can be broadly classified into two major categories, poisoning attack (Zügner, Akbarnejad, and Günnemann 2018; Zügner and Günnemann 2019) and evasion (Dai et al. 2018) attack, depending on the time they happen. The former (poisoning attack) happens during the training time of the GNNs through modifying training data, while the latter (evasion attack) takes place during the GNN inference time by changing test data samples. Our proposed SAG is a comprehensive attack method that can cover both types of attack meanwhile offering significant computation and memory efficiency compared with the existing attack approaches.

ADMM Framework. ADMM (Boyd et al. 2011) is an optimization framework that is effective in decomposing and solving optimization problems under constraints. The theoretical results of ADMM have been explored in (Gómez, Eftekhari, and Cevher 2019; Poon and Liang 2019; Yu 2019; Nishihara et al. 2015) for various convex problems under diverse constraints and are shown to have linear convergence. Formally, ADMM can effectively solve an optimization problem under linear constraints

min𝐱,𝐳f(𝐱)+g(𝐳)subject toA𝐱+B𝐳=𝐜\begin{split}\min_{\mathbf{x},\mathbf{z}}\;\;\;\;&f(\mathbf{x})+g(\mathbf{z})\\ \text{subject to}\;\;\;\;&A\mathbf{x}+B\mathbf{z}=\mathbf{c}\end{split} (1)

where f(𝐱)f(\mathbf{x}) and g(𝐳)g(\mathbf{z}) could be either differentiable or non-differentiable but has some exploitable structure properties. Then, by introducing the augmented Lagrangian function, ADMM can break the problem into two subproblems in 𝐱\mathbf{x} and 𝐳\mathbf{z} and iteratively solve each subproblem at one time. While popular stochastic gradient descent (SGD) method can usually solve optimization problems, SGD cannot effectively process the case with diverse constraints (e.g., equality constraints between variables) and usually require ad-hoc modifications on gradients.

Although ADMM is originally developed to solve convex problems, recent studies successfully exploit ADMM to solve NP-hard non-convex problems under constraints in CNN pruning (Fan et al. 2017; Zhang et al. 2018), compressive sensing (Yang et al. 2016), Auto-ML (Liu et al. 2020), Top-K feature selection (Fan et al. 2017), and hardware design (Ren et al. 2019). In this paper, we focus on exploring the benefit of exploiting ADMM framework in the context of graph-based adversarial attacks.

Table 2: Notations and Definitions used in our paper.
GG A single large graph
NN The number of nodes
DD The length of node features
EE The number of edges in the original input graph
AA The original input graph of shape N×NN\times N
XX The original input feature of shape N×DN\times D
WW Fixed GNN weights
SS A perturbation matrix of shape N×NN\times N
A~\tilde{A} Perturbed adjacency matrix of shape N×NN\times N
X~\tilde{X} Perturbed features of shape N×DN\times D
ϵA\epsilon_{A} The maximum number of allowed edge perturbations
ϵX\epsilon_{X} The maximum L2L_{2} norm of allowed feature perturbations
(,)\ell(\cdot,\cdot) The loss function to measure the difference between
the current output of the model and the targeted labels
MM The number of subproblems to split
SiS_{i} The ith{1,2,,K}i^{th}\in\{1,2,...,K\} graph partition of shape NM×N\frac{N}{M}\times N
X~i\tilde{X}_{i} The ith{1,2,,K}i^{th}\in\{1,2,...,K\} feature copy of shape N×DN\times D
ΠC[g]\Pi_{C}[g] The projection of input gg towards set CC
KK The number of epochs in ADMM
VGTV_{GT} Set of nodes with ground truth label

Methodology

Problem Formulation of Scalable Graph Attack

We first define the notation in this paper, as summarized in Table 2. We consider an input graph G=(A,X)G=(A,X), where A{0,1}N×NA\in\{0,1\}^{N\times N} is the adjacency matrix on the edge connection between nodes, XN×DX\in\mathcal{R}^{N\times D} is the associated node features, NN is the number of nodes in the graph, and DD is the feature dimension of each node. Here, only a portion of nodes VGTV_{GT} are labeled and the goal of node classification is to predict labels for remaining nodes. Following the common practice in the field (Xu et al. 2019a; Zügner, Akbarnejad, and Günnemann 2018), we focus on the well-established work that utilizes graph convolutional layers (Kipf and Welling 2017) for node classification. Formally, the ithi^{th} layer is defined as

H(k+1)=A~H(k)W(k)H^{(k+1)}=\tilde{A}H^{(k)}W^{(k)} (2)

where A~=D~12(A+IN)D~12\tilde{A}=\tilde{D}^{-\frac{1}{2}}(A+I_{N})\tilde{D}^{-\frac{1}{2}} to achieve numerical stability, DD is a diagonal matrix with Dii=j(A+IN)ijD_{ii}=\sum_{j}(A+I_{N})_{ij}, and W(k)W^{(k)} is the weights for the kthk^{th} layer. Since the memory and the computation complexity generally increase as the number of layers increases, we focus on a single layer GCN that is tractable and still captures the idea of graph convolutions:

Z=softmax(A~XW)Z=softmax(\tilde{A}XW) (3)

Here, the output Z{1,2,,c}NZ\in\{1,2,...,c\}^{N} is the predicted labels for individual nodes. The parameter WW is learned by minimizing the cross-entropy on the output of labeled nodes.

(A,X;W,YGT)=vVGTlnZv,Yv\ell(A,X;W,Y_{GT})=-\sum_{v\in V_{GT}}ln\;\;Z_{v,Y_{v}} (4)

where YvY_{v} is the ground truth label for node vv. Our experiments show that adversarial examples on this single layer GCN model transfer well to GCNs with various layers and other GNN models such as GAT. Besides, recent studies (Zügner, Akbarnejad, and Günnemann 2018) show that a L-layer GCN can also be treated as σ(ALXWL)\sigma(A^{L}XW^{L}) to generate adversarial attacks with tractable computation.

In this paper, SAG aims to generate a perturbed adjacency matrix A~{0,1}N×N\tilde{A}\in\{0,1\}^{N\times N} and a perturbed feature matrix X~N×D\tilde{X}\in\mathcal{R}^{N\times D} satisfying a pre-defined perturbation budget. Similar to (Xu et al. 2019a), we use a Boolean matrix S{0,1}N×NS\in\{0,1\}^{N\times N} to record the edge perturbations that Sij=1S_{ij}=1 indicates the perturbation of edge between node ii and node jj. Given the original adjacency matrix AA and its supplement matrix A¯\bar{A} (i.e., A¯i,j=¬Ai,j\bar{A}_{i,j}=\neg A_{i,j}), we can generate the perturbed adjacency matrix as A~=A+(A¯A)S\tilde{A}=A+(\bar{A}-A)\circ S, where \circ is the Hadamard product. Formally, given the edge perturbation budget ϵA\epsilon_{A} and the feature perturbation budget ϵX\epsilon_{X}, SAG aims to solve the following optimization problem

minS,X~(S,X~)s.t.X~X22ϵX1TSϵS,S{0,1}N×N\begin{split}\min_{S,\tilde{X}}\;\;\;\;\;&-\ell(S,\tilde{X})\\ \text{s.t.}\;\;\;\;\;&||\tilde{X}-X||_{2}^{2}\leq\epsilon_{X}\\ &1^{T}S\leq\epsilon_{S},S\in\{0,1\}^{N\times N}\end{split} (5)

For notation simplicity, we use (S,X~)\ell(S,\tilde{X}) to represent the cross-entropy loss (S,X~;W,YGT)\ell(S,\tilde{X};W,Y_{GT}) when the context is clear. Following the common practice (Zügner, Akbarnejad, and Günnemann 2018; Xu et al. 2019a) in the field, we consider two threat models – the evasive attack and the poisoning attack. The evasive attack assumes that the GNN model is fixed and targets the test data. The poisoning attack targets the training data and performs the model training phase on the perturbed data after the attack.

There are several challenges in solving this optimization problem. First, the popular stochastic gradient descent (SGD) methods cannot effectively solve optimization problems under constraints and usually require ad-hoc modification on the gradients to satisfying such constraints. Second, the discrete modification on the edge perturbations makes it a combinatorial optimization problem. (Zügner, Akbarnejad, and Günnemann 2018) attacks the graph topology by changing one edge at one time and selecting the edges that achieve the maximum loss. However, this approach needs to trial-and-error on all edges, leading to prohibitive time cost. (Xu et al. 2019a) takes gradient on the SS matrix by releasing the requirement S{0,1}N×NS\in\{0,1\}^{N\times N} to be S[0,1]N×NS\in[0,1]^{N\times N}. However, this approach takes N×NN\times N memory to store the gradient matrix, which is prohibitive for large graphs with tens of thousands of nodes. Besides, (Xu et al. 2019a) only supports topology attack and cannot conduct joint attack on the node features. We detail ADMM-based SAG on solving the optimization problems under constraints in later sections.

Refer to caption
Figure 1: Overview of SAG.

Scalable Graph Attack Framework using ADMM

SAG achieves a scalable attack on graph data by splitting the matrix SS into multiple partitions and consider one partition at each time, as illustrated in Figure 1. Supposing we split the graph into MM partitions, we only need to take gradient on a small matrix with shape NM×N\frac{N}{M}\times N that can easily fit into the memory. Similar to (Xu et al. 2019a) and the probabilistic view of node classification problem, we first release the discrete constraints S{0,1}N×NS\in\{0,1\}^{N\times N} to continuous constraints S[0,1]N×NS\in[0,1]^{N\times N}, where SijS_{ij} indicates the probability that an edge needs to be perturbed for attack. Since the impact between node ii and node jj may be asymmetry (i.e., the impact of node ii on node jj is higher than the reverse one), we do not apply symmetry constraint on SS. Then, we split S[0,1]N×NS\in[0,1]^{N\times N} into MM sub matrix such that Si[0,1](N/M)×NS_{i}\in[0,1]^{(N/M)\times N}, where SiS_{i} considers the nodes with index between [iNM,(i+1)NM][\left\lfloor i*\frac{N}{M}\right\rfloor,\left\lfloor(i+1)*\frac{N}{M}\right\rfloor]. Due to the cumulative property in cross-entropy loss, we can attack by solving the following optimization problem

minSi,X~i=1M(Si,X~)s.t.X~X22ϵX1TSiϵi,Si[0,1](N/M)×N\begin{split}\min_{S_{i},\tilde{X}}\;\;\;\;\;&-\sum_{i=1}^{M}\ell(S_{i},\tilde{X})\\ \text{s.t.}\;\;\;\;\;&||\tilde{X}-X||_{2}^{2}\leq\epsilon_{X}\\ &1^{T}S_{i}\leq\epsilon_{i},S_{i}\in[0,1]^{(N/M)\times N}\end{split} (6)

Here, ϵi\epsilon_{i} represents the allowed number of edges to change in each graph partition SiS_{i} and we set it to be ϵAM\frac{\epsilon_{A}}{M} for simplicity.

Ideally, we can split the problem 6 into MM sub-problems and solve them independently for memory efficiency. However, problem 6 still has interaction between MM sub-problems on the X~\tilde{X} term. To this end, we further reformulate the problem 6 into the following problem by substitute X~N×D\tilde{X}\in\mathcal{R}^{N\times D} with a duplicated feature matrix X~iN×D\tilde{X}_{i}\in\mathcal{R}^{N\times D}

minSi,X~ii=1M(Si,X~i)+i=1MICSi(Si)+i=1MICXi(X~i)s.t.Xi~=X~i+1,i{1,2,,M}\begin{split}\min_{S_{i},\tilde{X}_{i}}\;&-\sum_{i=1}^{M}\ell(S_{i},\tilde{X}_{i})+\sum_{i=1}^{M}I_{C_{Si}}(S_{i})+\sum_{i=1}^{M}I_{C_{Xi}}(\tilde{X}_{i})\\ \text{s.t.}\;\;\;\;&\tilde{X_{i}}=\tilde{X}_{i+1},\;\;\;\;i\in\{1,2,...,M\}\\ \end{split} (7)

For notation simplicity, we use X~M+1\tilde{X}_{M+1} to represent X~1\tilde{X}_{1}. IC(x)I_{C}(x) is an indicator function such that IC(X)=0I_{C}(X)=0 if XCX\in C, otherwise IC(X)=I_{C}(X)=\infty. CSiC_{Si} and CXiC_{Xi} are feasible sets

CSi={Si|Si[0,1]NM×N,  1TSiϵi}CXi={Xi~|Xi~X22ϵX}\begin{split}C_{Si}&=\{S_{i}\;|\;S_{i}\in[0,1]^{\frac{N}{M}\times N},\;\;1^{T}S_{i}\leq\epsilon_{i}\}\\ C_{Xi}&=\{\tilde{X_{i}}\;|\;||\tilde{X_{i}}-X||_{2}^{2}\leq\epsilon_{X}\}\\ \end{split} (8)

Here, the popular SGD cannot be easily applied to the optimization problem 7 under constraints, especially the equality ones. To this end, we can adopt the ADMM framework to systematically solve the problem. First, we have the Lagragian function Lρ(X~i,Si,μi)=i=1M(Si,X~i)+i=1MICXi(X~i)+i=1MICSi(Si)+ρ2i=1MX~iX~i+12+i=1MμiT(X~iX~i+1)\begin{split}L_{\rho}(\tilde{X}_{i},S_{i},\mu_{i})=&-\sum_{i=1}^{M}\ell(S_{i},\tilde{X}_{i})+\sum_{i=1}^{M}I_{C_{Xi}}(\tilde{X}_{i})+\sum_{i=1}^{M}I_{C_{Si}}(S_{i})\\ &+\frac{\rho}{2}\sum_{i=1}^{M}||\tilde{X}_{i}-\tilde{X}_{i+1}||^{2}+\sum_{i=1}^{M}\mu_{i}^{T}(\tilde{X}_{i}-\tilde{X}_{i+1})\end{split} (9) where ρ>0\rho>0 is a hyper-parameter and μiN×D\mu_{i}\in\mathcal{R}^{N\times D} is the dual variable.

Following the ADMM framework, we can solve the problem 7 by repeating for k{1,2,,K}k\in\{1,2,...,K\} iterations and, in each iteration, solving each SiS_{i} and X~i\tilde{X}_{i} individually

X~i(k+1)=argminX~iLρ(X~i,Si(k),μi(k))Si(k+1)=argminSiLρ(X~i(k+1),Si,μi(k))μi(k+1)=μi(k)+ρ(X~i(k+1)X~i+1(k+1))\begin{split}\tilde{X}_{i}^{(k+1)}&=\operatornamewithlimits{argmin}_{\tilde{X}_{i}}L_{\rho}(\tilde{X}_{i},S_{i}^{(k)},\mu_{i}^{(k)})\\ S_{i}^{(k+1)}&=\operatornamewithlimits{argmin}_{S_{i}}L_{\rho}(\tilde{X}_{i}^{(k+1)},S_{i},\mu_{i}^{(k)})\\ \mu_{i}^{(k+1)}&=\mu_{i}^{(k)}+\rho(\tilde{X}_{i}^{(k+1)}-\tilde{X}_{i+1}^{(k+1)})\end{split} (10)

which is respectively the feature update, topology update, and dual update. We stress that we only need to solve the minimization problem for a single graph partition SiNM×NS_{i}\in\mathcal{R}^{\frac{N}{M}\times N}, leading to much reduced memory consumption compared to (Xu et al. 2019a), which requires solving the whole graph SN×NS\in\mathcal{R}^{N\times N} at the same time. Here, the main memory overhead comes from the duplicated feature X~iN×D\tilde{X}_{i}\in\mathcal{R}^{N\times D}. However, the feature dimension DD is usually a fixed number around 10001000, which is much smaller than the number of nodes NN that may reach tens of thousands, or even millions.

Algorithm Subroutines for SAG Optimization

In this section, we present how to efficiently solve the above problem and derive closed form formula for individual minimization problems.

Feature Update. In feature update, we aim to find the feature X~i\tilde{X}_{i} that minimizes

(Si(k),X~i)+ICXi(X~i)+ρ2X~iX~i+1(k)2+μi(k)T(X~iX~i+1(k))-\ell(S_{i}^{(k)},\tilde{X}_{i})+I_{C_{Xi}}(\tilde{X}_{i})+\frac{\rho}{2}||\tilde{X}_{i}-\tilde{X}_{i+1}^{(k)}||^{2}+\mu_{i}^{(k)\;T}(\tilde{X}_{i}-\tilde{X}_{i+1}^{(k)}) (11)

Here, (,)\ell(\cdot,\cdot) is the cross-entropy loss on the GNN predictions, the last two terms are differentiable functions, but the ICXi()I_{C_{Xi}}(\cdot) function cannot be easily solved with gradient descent method due to the \infty values for X~iCXi\tilde{X}_{i}\notin C_{Xi}. To this end, we adopt a two-step strategy for the feature update. For the first three terms that are differentiable, we use the gradient descent method to access the gradient gXi(k)g_{Xi}^{(k)} on X~i\tilde{X}_{i} and get a pseudo-perturbed feature X~i(k+1)\tilde{X}_{i}^{(k+1)^{\prime}}

X~i(k+1)=X~i(k)ηkgXi(k)\tilde{X}_{i}^{(k+1)^{\prime}}=\tilde{X}_{i}^{(k)}-\eta_{k}\cdot g_{X_{i}}^{(k)} (12)

where ηk\eta_{k} is the learning rate at iteration kk. Note that the update on feature X~i\tilde{X}_{i} depends only on the graph partition Si(k)S_{i}^{(k)} and does not involve with remaining graph partitions. For the term ICXi()I_{C_{Xi}}(\cdot), we refer to the projection method by projecting the pseudo-perturbed feature onto the feasible set CXiC_{Xi}:

X~i(k+1)=ΠCXi[X~i(k)ηkgXi(k)]gXi(k)=Xi(Si(k),Xi(k))+ρ(Xi(k)Xi+1(k))+μ(k)\begin{split}\tilde{X}_{i}^{(k+1)}&=\Pi_{C_{Xi}}[\tilde{X}_{i}^{(k)}-\eta_{k}\cdot g_{X_{i}}^{(k)}]\\ g_{X_{i}}^{(k)}&=-\frac{\partial}{\partial X_{i}}\ell(S_{i}^{(k)},X_{i}^{(k)})+\rho(X_{i}^{(k)}-X_{i+1}^{(k)})+\mu^{(k)}\end{split} (13)

While computing the projection is a difficult task in general and usually requires an iterative procedure to solve, we exploit the special structure of CXiC_{Xi} and derive a closed-form formula to analytically solve it.

Proposition 1. Given CX={a|aX22ϵX}C_{X}=\{a\;|\;||a-X||_{2}^{2}\leq\epsilon_{X}\}, the projection of aa to CXC_{X} is

ΠCX(a)={a+uX1+uif u>0 and u=aX22ϵX1aif aX22ϵX\Pi_{C_{X}}(a)=\begin{cases}\frac{a+uX}{1+u}\;\;&\text{if $u>0$ and $u=\sqrt{\frac{||a-X||_{2}^{2}}{\epsilon_{X}}}-1$}\\ a\;\;&\text{if $||a-X||_{2}^{2}\leq\epsilon_{X}$}\end{cases} (14)

Proof: ΠCX(a)\Pi_{C_{X}}(a) can be viewed as an optimization problem

minRRa22s.t.(RX)T(RX)ϵX\begin{split}\min_{R}\;\;\;\;&||R-a||_{2}^{2}\\ \text{s.t.}\;\;\;\;&(R-X)^{T}(R-X)\leq\epsilon_{X}\end{split} (15)

We can derive its Lagrangian function as

L(R,u)=Ra22+u[(RX)T(RX)ϵX]L(R,u)=||R-a||_{2}^{2}+u[(R-X)^{T}(R-X)-\epsilon_{X}] (16)

where u0u\geq 0. Using the KKT condition we have the stationary condition that

LR=2(Ra)+2u(RX)=0\frac{\partial L}{\partial R}=2(R-a)+2u(R-X)=0 (17)

and get that R=a+uX1+uR=\frac{a+uX}{1+u}. We can also get the complementary slackness that

u[(RX)T(RX)ϵX]=0u[(R-X)^{T}(R-X)-\epsilon_{X}]=0 (18)

If u>0u>0, we need to have R=a+uX1+uR=\frac{a+uX}{1+u} and (RX)T(RX)=ϵX(R-X)^{T}(R-X)=\epsilon_{X}. By reformulating, we can have u=aX22ϵX1u=\sqrt{\frac{||a-X||_{2}^{2}}{\epsilon_{X}}}-1. If u=0u=0 and aX22ϵX||a-X||_{2}^{2}\leq\epsilon_{X}, we have R=aR=a.

Topology Update. In topology update, we aim to minimize the following function by finding SiS_{i}

(Si,X~i(k+1))+ICSi(Si)-\ell(S_{i},\tilde{X}_{i}^{(k+1)})+I_{C_{Si}}(S_{i}) (19)

Similar to feature update, we can first use the gradient descent method to access the gradient gSi(k)g_{S_{i}}^{(k)} on SiS_{i} and then use the projection method to generate the perturbed topology in the feasible set CSiC_{S_{i}}

Si(k+1)=ΠCSi[Si(k)ηtgSi(k)]gSi(k)=Si(Si,X~i(k+1))\begin{split}S_{i}^{(k+1)}&=\Pi_{C_{S_{i}}}[S_{i}^{(k)}-\eta_{t}\cdot g_{S_{i}}^{(k)}]\\ g_{S_{i}}^{(k)}&=-\frac{\partial}{\partial S_{i}}\ell(S_{i},\tilde{X}_{i}^{(k+1)})\\ \end{split} (20)

where ηt\eta_{t} is the learning rate. With the Langrangian function and KKT condition, we can derive the closed-form formula to analytically project gSikg_{S_{i}}^{k} to the feasible set CSiC_{S_{i}}. Due to the similarity with proof for Proposition 1 and page limits, we leave the detailed proof to appendix.

Proposition 2. Given CSi={Si|Si[0,1]NM×N,1TSiϵi}C_{S_{i}}=\{S_{i}\;|\;S_{i}\in[0,1]^{\frac{N}{M}\times N},1^{T}S_{i}\leq\epsilon_{i}\}, the projection of aa to CSiC_{S_{i}} is

ΠCSi(a)={P[0,1](au1)if u>0 and 1TP[0,1](au1)=ϵiP[0,1](a)if 1TP[0,1](a)ϵi\Pi_{C_{S_{i}}}(a)=\begin{cases}P_{[0,1]}(a-u1)\;\;\text{if $u>0$ and $1^{T}P_{[0,1]}(a-u1)=\epsilon_{i}$}\\ P_{[0,1]}(a)\;\;\text{if $1^{T}P_{[0,1]}(a)\leq\epsilon_{i}$}\end{cases} (21)

where P[0,1](x)=x,ifx[0,1];0,ifx<0;1,ifx>1.P_{[0,1]}(x)=x,if\;x\in[0,1];0,if\;x<0;1,if\;x>1.

1 Input: Given A, X, fixed GNN weights WW, learning rate ηt\eta_{t}, epoch number KK, and partition number MM;
2Initialize: Si=AiS_{i}=A_{i}, Xi=XX_{i}=X
3for k=1,2,,Kk=1,2,...,K do
4       for i=1,2,,Mi=1,2,...,M do
5             Feature Update on X~i\tilde{X}_{i}:
6            X~ik+1=ΠCXi[X~ikηkgXik]\tilde{X}_{i}^{k+1}=\Pi_{C_{Xi}}[\tilde{X}_{i}^{k}-\eta_{k}\cdot g_{X_{i}}^{k}] with Eq 13.
7            Topology Update on SiS_{i}:
8            Sik+1=ΠCSi[SikηtgSik]S_{i}^{k+1}=\Pi_{C_{S_{i}}}[S_{i}^{k}-\eta_{t}\cdot g_{S_{i}}^{k}] with Eq 20.
9            Dual Update on μi\mu_{i}:
10            μik+1=μIk+ρ(X~ik+1X~i+1k+1)\mu_{i}^{k+1}=\mu_{I}^{k}+\rho\cdot(\tilde{X}_{i}^{k+1}-\tilde{X}_{i+1}^{k+1}) with Eq 10.
11       end for
12      
13 end for
Sample and generate the final perturbed matrix.
Algorithm 1 SAG to solve Problem 7.

Optimization and Complexity Analysis. We summarize our SAG in Algorithm 1. There are two optimization loops. In the inner loop, we iterate through MM graph partitions and update the corresponding partitioned graph perturbation SiS_{i} and feature perturbation X~i\tilde{X}_{i}. At each iteration, only a single graph partition SiS_{i} needs to be considered and large memory is saved for not considering the other M1M-1 partitions. In the outer loop, we repeat the optimization for KK (=200 by default) iterations for the algorithm to converge.

After KK iterations, X~i\tilde{X}_{i} will be identical to each other and can be directly used as the feature perturbation X~\tilde{X}. Recalling that SiS_{i} is a probability whether an edge needs to be perturbed for attack, we use Bernoulli distribution to sample a 0-10\text{-}1-valued edge between each pair of nodes. We repeat this sample procedure for 2020 times and select the one with minimal loss for the final perturbed topology SN×NS\in\mathcal{R}^{N\times N}.

The memory complexity of SAG is

O(NMN+ND)O(\frac{N}{M}\cdot N+N\cdot D) (22)

for storing graph partition SiS_{i} and feature perturbation X~i\tilde{X}_{i}, respectively, since we only need to optimize one graph partition at each iteration. We store only the current graph partition in the limited GPU memory (around 10 GB) and offload the remaining graph partitions to the large host memory (more than 50 GB). Note that the same implementation strategy cannot be applied to (Xu et al. 2019a) which requires attacking the whole graph SN×NS\in\mathcal{R}^{N\times N} at each iteration.

Evaluation

In this section, we evaluate SAG on five datasets and compare with three attack algorithms to show its effectiveness.

Datasets. In this experiment, we select various datasets to cover the vast majority of the GNN inputs, including typical datasets (Citeseer, Cora, Pubmed, and Amazon-Computer/Photo) used by many GNN papers (Kipf and Welling 2017; Xu et al. 2019b; Hamilton, Ying, and Leskovec 2017). Details of these datasets are listed in Table 3.

Baselines. To evaluate the effectiveness of SAG, we compare it with the state-of-the-art attack methods by using the adversarial attack repository DeepRobust 111https://github.com/DSE-MSU/DeepRobust.git.

  • FGSM (Goodfellow, Shlens, and Szegedy 2015) is a gradient-based adversial attack that generates adversarial examples by perturbing the input features. While it is originally designed for attacking CNNs, it can be easily adapted to attack GNNs by perturbing node features.

  • PGD (Xu et al. 2019a) is another gradient-based attack method tailored for discrete graph data. The reason to select PGD for comparison is that it provides fast attack on discrete graph data by leveraging an optimization-based approach.

  • Nettack (Zügner, Akbarnejad, and Günnemann 2018) is the most popular attack algorithm on graph data by incorporating both edge and feature attacks. We select Nettack for comparison because it serves as a strong baseline for SAG on the joint attack.

Models. Graph Convolutional Network (GCN(Kipf and Welling 2017) is one of the most popular GNN architectures. It has been widely adopted in node classification, graph classification, and link prediction tasks. Besides, it is also the key backbone network for many other GNNs, such as GraphSage (Hamilton, Ying, and Leskovec 2017), and differentiable pooling (Diffpool) (Ying et al. 2018). We use the setting of hidden dimension size = 16 for each layer of GCN. Graph Attention Network (GAT(Xu et al. 2019b), another typical category of GNN, aims to distinguish the graph-structure that cannot be identified by GCN. GAT differs from GCN in its aggregation function, which assigns different weights for different nodes during the aggregation. We use the setting of 8 hidden dimension and 8 attention heads for each layer of GAT.

Platforms. We implement SAG based on PyTorch Geometric (Fey and Lenssen 2019). We evaluate SAG on Dell T7910 (Ubuntu 18.04) with Intel Xeon CPU E5-2603, 64 GB host memory, and an NVIDIA 1080Ti GPU with 12 GB memory.

Table 3: Datasets for Evaluation.
  Dataset #Vertex #Edge #Dim #Class
  Cora 2,708 10,858 1,433 7
Citeseer 3,327 9,464 3,703 6
Amazon-Photo 7,487 119,043 745 8
Amazon-Computer 13,381 245,778 767 10
Pubmed 19,717 88,676 500 3
 
Table 4: Evaluation of SAG with existing adversarial attacks.
Dataset Method Time (min) Mem. (GB) Evasive Acc. (%) Poisoning Acc. (%)
Cora Clean 0 0 79.63 79.63
FGSM 0.05 0.64 70.52 78.87
PGD 0.25 1.03 75.70 70.77
Nettack 14.75 0.68 69.62 71.35
SAG 0.48 0.70 68.06 67.15
Citeseer Clean 0 0 71.80 71.80
FGSM 0.03 0.41 67.00 71.70
PGD 0.18 0.97 69.08 67.30
Nettack 12.63 0.60 62.91 67.54
SAG 0.34 0.82 62.62 63.68
Amazon Photo Clean 0 0 93.11 93.11
FGSM 0.05 1.17 89.13 91.37
PGD 4.21 3.69 83.02 82.77
Nettack 1560 1.84 87.95 89.25
SAG 6.13 1.24 80.52 80.12
Amazon Computer Clean 0 0 89.29 89.29
FGSM 3.28 2.67 85.62 88.03
PGD 17.32 10.59 77.22 77.93
Nettack 2578 4.35 82.41 85.33
SAG 18.78 2.91 74.56 76.31
Pubmed Clean 0 0 77.53 77.53
FGSM 9.83 3.54 72.93 74.25
PGD - OOM - -
Nettack 3108 8.17 67.72 76.25
SAG 47.62 3.63 62.71 36.27
  • 1

    Note that “-” means such parameter is not applicable to the given setting.

  • 2

    OOM refers to “Out of Memory”.

Metrics. We evaluate SAG with six metrics – evasive accuracy, poisoning accuracy, topology ratio, feature ratio, memory consumption, and running time. Following the common setting (Zügner, Akbarnejad, and Günnemann 2018; Xu et al. 2019a), we report the evasive accuracy by assuming the GNN model is fixed and targeting the test data. We report the poisoning accuracy by targeting the training data and perform the model training phase after the attack. The topology ratio (%) is computed as the number of attacked edges over the number of existing edges in the clean graph dataset. The feature ratio (%) is reported as the L2L_{2} norm of perturbed features over the L2L_{2} norm of the original features in the clean graph dataset. To measure the memory, we utilize NVProf to query the runtime GPU memory consumption at a pre-selected frequency of 100100 ms. To measure the time, we leverage the software timer from python timetime library. For a fair comparison, all gradient-based approaches are conducted for 200200 iterations.

Overall Performance

Table  4 shows the overall performance comparison between SAG and existing adversarial attacks under the same setting of topology ratio and feature ratio. Following the most common setting used by many previous papers (Zügner, Akbarnejad, and Günnemann 2018; Xu et al. 2019a), we select the same topology attack ratio of 5%5\% and feature attack ratio of 2%2\%, and leave the study on diverse ratios to the ablation study. On PGD and FGSM, we only attack the topology and features, respectively, due to their limits in attacking capability. In SAG, we stick to M=2 and will exhibit the impact of M in the ablation study. We observe that SAG consistently outperforms the state-of-the-art attack approach, such as FGSM, PGD, and Nettack on evasive attack (up to 14.82%14.82\% accuracy drop) and poisoning attack (up to 41.26%41.26\% accuracy drop) across different datasets. On Pubmed dataset, SAG achieves 14.82%14.82\% accuracy drop in evasive attack and 41.26%41.26\% in poisoning attack. The major reason for such success is that SAG enables the gradient-based joint optimization on both the features and topology while incorporating global reasoning on the interaction between attacking different nodes. By contrast, FGSM and PGD attack only the feature or topology, and Nettack considers only one edge at each time, failing to reason the global interaction across edges and nodes.

Refer to caption
(a) Attack Loss
Refer to caption
(b) X~i(k)X~i+1(k)2||\tilde{X}_{i}^{(k)}-\tilde{X}_{i+1}^{(k)}||_{2}
Refer to caption
(c) X~i(k+1)X~i(k)2||\tilde{X}_{i}^{(k+1)}-\tilde{X}_{i}^{(k)}||_{2}
Figure 2: Convergence Behavior of SAG on Cora and Pubmed.

Across different datasets and settings, we notice that Nettack always comes with the highest time cost. The reason is that, at each iteration, it selects only one edge or feature to attack by examining all edges and node features, and repeats the procedure until reaching the topology ratio and the feature ratio. Moreover, we observe that PGD usually has the highest memory consumption, since it requires a floating-point N×NN\times N matrix to store the edge gradients between each pair of nodes, where NN is the number of nodes. For Pubmed, a single matrix of this shape requires at least 1.51.5 GB to store and PGD processes the whole matrix at each iteration, instead of processing only a small partition as the case in SAG. Besides, in the time and memory comparison among these implementations, we notice the significant strength of SAG, which achieves up to 254×254\times speedup and 3.6×3.6\times memory reduction. This is largely due to our effective algorithmic and implementation optimizations that can reduce the runtime complexity meanwhile amortizing the memory overhead to great extent.

Ablation Studies

In this ablation study, we will focus on two representative datasets – Cora and Pubmed – for the performance of SAG on a small graph dataset and a large graph dataset.

ADMM Convergence Behavior We show the ADMM convergence behavior in Figure  2. Here, we adopt the same number of epochs as T=200T=200 and show only the first 7070 epochs in Figure 2(b) and Figure 2(c) since SAG converges fast on X~i(k)X~i+1(k)2||\tilde{X}_{i}^{(k)}-\tilde{X}_{i+1}^{(k)}||_{2} and X~i(k+1)X~i(k)2||\tilde{X}_{i}^{(k+1)}-\tilde{X}_{i}^{(k)}||_{2}. We only present the result for i=1i=1 since various ii’s show similar results. Overall, SAG converges gracefully as epoch increases, demonstrating the effectiveness of our method. In Figure 2(a), with the increase of the epoch number, SAG gradually increases the attack loss by perturbing the features and the topology. In Figure 2(b), X~i(k)X~i+1(k)2||\tilde{X}_{i}^{(k)}-\tilde{X}_{i+1}^{(k)}||_{2} starts from 0 since we initialize both X~i\tilde{X}_{i} and X~i+1\tilde{X}_{i+1} to the node features on clean graph dataset. Compared across different datasets, Pubmed converges faster than Cora since Pubmed contains much smaller feature dimension than Cora.

Topology and Feature Perturbation For poisoning attack on Cora dataset (Table 5), the increase of feature perturbation and topology perturbation would both lead to the accuracy drop compared with the original clean data. Besides, we observe that, at the same level of topology perturbation, 8%8\% feature perturbation can lead to 3.6%3.6\% extra accuracy drop on average. On Pubmed dataset, we observe that 8%8\% feature perturbation can lead to 9.4%9.4\% extra accuracy drop, averaged over various topology perturbation ratio. These results show the benefit of attacking both topology and features.

Table 5: Accuracy (%) of SAG under the diverse ratio of perturbed edges and feature attacks on Cora and Pubmed.
  Cora Feature Perturbation (%)
Topology Perturbation (%) 0 1 2 4 8
0 79.63 79.25 78.91 78.84 78.73
5 70.47 69.11 68.71 66.52 65.59
10 58.63 57.29 56.34 55.23 53.77
15 50.15 49.60 47.53 47.04 46.73
20 45.37 43.46 43.06 42.71 41.65
  Pubmed Feature Perturbation (%)
Topology Perturbation (%) 0 1 2 4 8
0 77.53 75.34 73.91 72.55 68.65
5 49.19 39.14 36.27 31.65 28.53
10 33.24 28.57 27.81 26.35 25.30
15 28.04 26.05 23.59 23.04 22.34
20 23.95 21.62 21.09 20.71 20.38
Table 6: Impact of M for Poisoning Attack on Pubmed.
  M Time Memory Accuracy Drop
(min) (GB) (%)
1 33 5.72 35.37
2 42 3.63 36.27
4 62 2.75 36.18
8 95 2.21 35.97
 
Table 7: SAG Transferability for Poisoning Attack.
  Cora (%) Pubmed (%)
1-layer GCN 0.67 (0.80) 0.35 (0.77)
2-layer GCN 0.78 (0.83) 0.80 (0.85)
4-layer GCN 0.74 (0.81) 0.76 (0.84)
1-layer GAT 0.74 (0.82) 0.37 (0.79)
2-layer GAT 0.77 (0.83) 0.78 (0.85)
4-layer GAT 0.75 (0.80) 0.73 (0.82)
 
  • 1

    Data Format: attacked data acc. (clean data acc.).

M-value Impact We also evaluate SAG for poisoning attack on Pubmed to show the impact of the hyperparameter M (i.e., the number of graph partitions) on memory saving. As shown in Table 6, with the increase of the M value, the memory size reduction becomes significant, since splitting the graph into M partitions and attacking individual partitions at each time essentially reduce the memory requirement. We also observe similar accuracy drop under different M since SAG converges gracefully and hits similar optimal points for diverse M. Meanwhile, we also observe that the increase of value M also brings the runtime overhead in terms of time cost, for example, M=8 setting is 33 minutes slower than M=4 setting. This slowdown happens since we need to attack individual split graphs at each time, leading to a small portion of system overhead on memory access. This also leads to a tradeoff among these factors when selecting the value of M. We also observe that the memory consumption does not decrease linearly as K increases. The main reason is that, as M reduces, memory consumption from other sources (e.g., loading PyTorch framework and the features) becomes the dominant component.

Transferability To demonstrate the transferability of SAG, we further evaluate our attacked graphs (Cora and Pubmed) on GCNs (with 1, 2, and 4 layers) and GATs (with 1, 2, and 4 layers), respectively. We generate adversarial examples on a 1-layer GCN model and conduct poisoning attack on other models by targeting the training data and training these models on the perturbed data. As shown in Table 7, SAG can effectively maximize the accuracy drop on Cora (up to 13%) and Pubmed (up to 42%). The major reason for such success in launching the poisoning attack is that the adversarial attack on 1-layer GCN effectively captures the intrinsic property on the graph data that is agnostic to the models. We want to stress that, even on models with different layers (i.e., 2 and 4), the poisoned graph data can still achieve 8%8\% and 9%9\% accuracy drop on Cora and Pubmed, respectively. These results demonstrate the transferability of SAG towards models with diverse architectures and the number of layers.

Conclusion

This work focuses on GNN robustness by giving an in-depth understanding of GNN’s “weakness”. We propose SAG, the first scalable adversarial attack method with Alternating Direction Method of Multipliers (ADMM), which can successfully overcome the limitations of the previous solutions. Extensive experiments further highlight SAG’s advantage of reducing the computation and memory overhead over the existing approaches.

References

  • Boyd et al. (2011) Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3(1): 1–122. ISSN 1935-8237. doi:10.1561/2200000016. URL https://doi.org/10.1561/2200000016.
  • Dai et al. (2018) Dai, H.; Li, H.; Tian, T.; Huang, X.; Wang, L.; Zhu, J.; and Song, L. 2018. Adversarial attack on graph structured data. arXiv preprint arXiv:1806.02371 .
  • Fan et al. (2017) Fan, M.; Chang, X.; Zhang, X.; Wang, D.; and Du, L. 2017. Top-k Supervise Feature Selection via ADMM for Integer Programming. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 1646–1653. doi:10.24963/ijcai.2017/228. URL https://doi.org/10.24963/ijcai.2017/228.
  • Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (ICLR).
  • Gómez, Eftekhari, and Cevher (2019) Gómez, F. L.; Eftekhari, A.; and Cevher, V. 2019. Fast and Provable ADMM for Learning with Generative Priors. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 12004–12016. URL http://papers.nips.cc/paper/9371-fast-and-provable-admm-for-learning-with-generative-priors.
  • Goodfellow, Shlens, and Szegedy (2015) Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICML). URL http://arxiv.org/abs/1412.6572.
  • Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems (NeurIPS), 1024–1034.
  • Kipf and Welling (2017) Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) .
  • Leng et al. (2018) Leng, C.; Dou, Z.; Li, H.; Zhu, S.; and Jin, R. 2018. Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM. In McIlraith, S. A.; and Weinberger, K. Q., eds., Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 3466–3473. AAAI Press. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16767.
  • Liu et al. (2020) Liu, S.; Ram, P.; Vijaykeerthy, D.; Bouneffouf, D.; Bramble, G.; Samulowitz, H.; Wang, D.; Conn, A.; and Gray, A. G. 2020. An ADMM Based Framework for AutoML Pipeline Configuration. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 4892–4899. AAAI Press. URL https://aaai.org/ojs/index.php/AAAI/article/view/5926.
  • Nishihara et al. (2015) Nishihara, R.; Lessard, L.; Recht, B.; Packard, A.; and Jordan, M. 2015. A General Analysis of the Convergence of ADMM. volume 37 of Proceedings of Machine Learning Research, 343–352. Lille, France: PMLR. URL http://proceedings.mlr.press/v37/nishihara15.html.
  • Poon and Liang (2019) Poon, C.; and Liang, J. 2019. Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 7355–7363. URL http://papers.nips.cc/paper/8955-trajectory-of-alternating-direction-method-of-multipliers-and-adaptive-acceleration.
  • Ren et al. (2019) Ren, A.; Zhang, T.; Ye, S.; Li, J.; Xu, W.; Qian, X.; Lin, X.; and Wang, Y. 2019. ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, 925–938. New York, NY, USA: Association for Computing Machinery. ISBN 9781450362405. doi:10.1145/3297858.3304076. URL https://doi.org/10.1145/3297858.3304076.
  • Xu et al. (2019a) Xu, K.; Chen, H.; Liu, S.; Chen, P.-Y.; Weng, T.-W.; Hong, M.; and Lin, X. 2019a. Topology attack and defense for graph neural networks: An optimization perspective. arXiv preprint arXiv:1906.04214 .
  • Xu et al. (2019b) Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019b. How Powerful are Graph Neural Networks? In International Conference on Learning Representations (ICLR). URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Yang et al. (2016) Yang, Y.; Sun, J.; Li, H.; and Xu, Z. 2016. Deep ADMM-Net for Compressive Sensing MRI. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 10–18. Curran Associates, Inc. URL http://papers.nips.cc/paper/6406-deep-admm-net-for-compressive-sensing-mri.pdf.
  • Ying et al. (2018) Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018. Hierarchical Graph Representation Learning with Differentiable Pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), 4805–4815. Red Hook, NY, USA.
  • Yu (2019) Yu, H. 2019. A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 8622–8631. URL http://papers.nips.cc/paper/9068-a-communication-efficient-stochastic-multi-block-alternating-direction-method-of-multipliers.
  • Zhang et al. (2018) Zhang, T.; Ye, S.; Zhang, K.; Tang, J.; Wen, W.; Fardad, M.; and Wang, Y. 2018. A Systematic DNN Weight Pruning Framework Using Alternating Direction Method of Multipliers. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., Computer Vision – ECCV 2018, 191–207. Cham: Springer International Publishing. ISBN 978-3-030-01237-3.
  • Zügner, Akbarnejad, and Günnemann (2018) Zügner, D.; Akbarnejad, A.; and Günnemann, S. 2018. Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2847–2856.
  • Zügner and Günnemann (2019) Zügner, D.; and Günnemann, S. 2019. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412 .

Proof of Proposition 2

In this section, we provide the proof for Proposition 2. Similar to the proof in Proposition1 and existing works on projection (Leng et al. 2018; Zhang et al. 2018; Xu et al. 2019a), we utilize the Langrangian function and the KKT contition to derive a closed-form formula to project a given input towards the feasible set CSiC_{S_{i}}.

Proposition 2. Given CSi={Si|Si[0,1]NM×N,1TSiϵi}C_{S_{i}}=\{S_{i}\;|\;S_{i}\in[0,1]^{\frac{N}{M}\times N},1^{T}S_{i}\leq\epsilon_{i}\}, the projection of aa to CSiC_{S_{i}} is

ΠCSi(a)={P[0,1](au1)if u>0 and 1TP[0,1](au1)=ϵiP[0,1](a)if 1TP[0,1](a)ϵi\Pi_{C_{S_{i}}}(a)=\begin{cases}P_{[0,1]}(a-u1)\;\;\text{if $u>0$ and $1^{T}P_{[0,1]}(a-u1)=\epsilon_{i}$}\\ P_{[0,1]}(a)\;\;\text{if $1^{T}P_{[0,1]}(a)\leq\epsilon_{i}$}\end{cases} (23)

where P[0,1](x)=x,ifx[0,1];0,ifx<0;1,ifx>1.P_{[0,1]}(x)=x,if\;x\in[0,1];0,if\;x<0;1,if\;x>1.

Proof: We first transform the projection problem ΠCSi\Pi_{C_{S_{i}}} into an optimization problem

minR12Ra22s.t.R[0,1]NM×N1tRϵi\begin{split}\min_{R}\;\;\;\;&\frac{1}{2}||R-a||_{2}^{2}\\ \text{s.t.}\;\;\;\;&R\in[0,1]^{\frac{N}{M}\times N}\\ &1^{t}R\leq\epsilon_{i}\end{split} (24)

Then, we can derive its Langrangian function as

L(R,u)=12Ra22+I[0,1](R)+u(1TRϵi)L(R,u)=\frac{1}{2}||R-a||_{2}^{2}+I_{[0,1]}(R)+u(1^{T}R-\epsilon_{i}) (25)

where u0u\leq 0 is the dual variable. Here, I[0,1](R)=0 if R[0,1]NM×N;= otherwise.I_{[0,1]}(R)=0\text{ if }R\in[0,1]^{\frac{N}{M}\times N};=\infty\text{ otherwise.} Using the KKT condition, we have the stationary condition that

LR=(Ra)+u1+RI[0,1](R)=0\frac{\partial L}{\partial R}=(R-a)+u1+\frac{\partial}{\partial R}I_{[0,1]}(R)=0 (26)

Here, RI[0,1](R)=0 if R[0,1]NM×N;= otherwise\frac{\partial}{\partial R}I_{[0,1]}(R)=0\text{ if }R\in[0,1]^{\frac{N}{M}\times N};=\infty\text{ otherwise} We have R=P[0,1](au1)R=P_{[0,1]}(a-u1), where P[0,1](x)=x,ifx[0,1];=0,ifx<0;=1,ifx>1P_{[0,1]}(x)=x,if\;x\in[0,1];=0,if\;x<0;=1,if\;x>1, and P[0,1](x)P_{[0,1]}(x) is element-wisely applied on (au1)(a-u1). Using the KKT condition, we also have the complementary slackness

u(1TRϵi)=0u(1^{T}R-\epsilon_{i})=0 (27)

If u=0u=0, we need to have R=P[0,1](a)R=P_{[0,1]}(a). If u>0u>0, we need to have R=P[0,1](au1)R=P_{[0,1]}(a-u1) and 1TRϵi=01^{T}R-\epsilon_{i}=0. In other words, we have 1TP[0,1](au1)=ϵi1^{T}P_{[0,1]}(a-u1)=\epsilon_{i}, where uu is a scalar variable and can be solved with the bisection method (Boyd et al. 2011).