This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks

Shiwen Ni, Jiawen Li & Hung-Yu Kao
Department of Computer Science and Information Engineering
National Cheng Kung University
Tainan, Taiwan
{P78083033, P78073012}@gs.ncku.edu.tw, [email protected]
Corresponding author
Abstract

Adversarial training has been proven to be a powerful regularization method to improve the generalization of models. However, current adversarial training methods only attack the original input sample or the embedding vectors, and their attacks lack coverage and diversity. To further enhance the breadth and depth of attack, we propose a novel masked weight adversarial training method called DropAttack, which enhances generalization of model by adding intentionally worst-case adversarial perturbations to both the input and hidden layers in different dimensions and minimize the adversarial risks generated by each layer. DropAttack is a general technique and can be adopt to a wide variety of neural networks with different architectures. To validate the effectiveness of the proposed method, we used five public datasets in the fields of natural language processing (NLP) and computer vision (CV) for experimental evaluating. We compare the proposed method with other adversarial training methods and regularization methods, and our method achieves state-of-the-art on all datasets. In addition, Dropattack can achieve the same performance when it use only a half training data compared to other standard training method. Theoretical analysis reveals that DropAttack can perform gradient regularization at random on some of the input and wight parameters of the model. Further visualization experiments show that DropAttack can push the minimum risk of the model to a lower and flatter loss landscapes. Our source code is publicly available on github111https://github.com/nishiwen1214/DropAttack.

1 Introduction

Deep neural networks (DNNs) (LeCun et al., 2015) have achieved state-of-the-art performance in many artificial intelligence applications, such as natural language processing and computer vision. Regularization methods such as L1 Tibshirani (1996) and L2 (Tikhonov, 1943) regularization, early stopping (Morgan & Bourlard, 1989) and Dropout (Srivastava et al., 2014), play an important role in the impressive performance of deep networks by controlling the model complexity, and thus preventing overfitting and improving generalization. Adversarial training (Goodfellow et al., 2015) was originally proposed as a method to improve the security of machine learning systems in order to train a neural network that is robust to attack samples. And adversarial training is the process of training a model, which minimizes the maximal risk for label-preserving input perturbations. It improves not only robustness to adversarial examples, but also generalization performance for original examples. Goodfellow et al. (2015) have demonstrated that adversarial training can result in regularization; even further regularization than dropout.

In this work, we mainly focus on improving the generalization performance of the model and preventing the model from overfitting, rather than enhancing the robustness of model to attack samples. Miyato et al. (2017) applies adversarial training to text classification tasks, and finds that adversarial training can effectively improve the generalization of text or RNN models on the test set. Most of the recent adversarial training is aimed at the input of the model to attack. Moreover, these existing adversarial training methods add perturbation to every element in the input tensor during the attack. (It should be noted that the input in the NLP field is the embeddings of the text, and in the CV field it is the value of each pixel of the picture. In this paper, it is uniformly called the input for the convenience of expression.) In order to increase the breadth of the attack, we expanded the attack target from the input to the weight parameters of other layers, that is, in the process of adversarial training, while attacking the input, it also attacks the weight parameters of other layers. In each iteration of the attack, we randomly mask the attack on a certain proportion elements instead of attacking all the elements in the input or weight tensor. In this way, exponentially different attack combinations can be obtained, and the internal adversarial loss of the model can be maximized.

In this paper, we show the impact of DropAttack and various other well-known regularization methods on the generalization performance of the model. We experiment with different neural network models on five different public datasets to prove the effectiveness of the proposed method. We visually analyze the training and verification accuracy of some models of different architectures as the training progresses. And we also analyzed the impact of hyperparameters and optimization of multi-forward-backward propagation on DropAttack. Finally, we also conducted a theoretical analysis of the proposed method from another perspective to prove the effectiveness of DropAttack.

2 Related work

Adversarial training can be traced back to (Goodfellow et al., 2015), in which the model improves the robustness and generalization of the model by generating adversarial examples and injecting them into training data. The effectiveness of adversarial training largely depends on the direction of the attack, so it is necessary to find the perturbation value that maximizes the adversarial loss. Due to its linear characteristics, neural networks are easily attacked by linear perturbation. Therefore, Goodfellow et al. (2015) proposed the Fast Gradient Sign Method (FGSM) to calculate the perturbation of the input sample. They linearized the cost function around the current value of parameters, obtaining an optimal max-norm constrained pertubation of:

𝒓𝒂𝒅𝒗=ϵsgn(𝒙L(𝜽,𝒙,y))\bm{r_{adv}}=\epsilon\cdot{\rm sgn}(\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)) (1)

Where 𝜽\bm{\theta} is the model parameter, 𝒙\bm{x} is the input of the model, y is the label corresponding to the input, LL is the cost used to train the neural network. sgn is the symbolic function, and ϵ\epsilon is the perturbation coefficient. In order to find a better perturbation, Miyato et al. (2017) proposed the Fast Gradient Method (FGM), which made a simple modification to the calculation of perturbation in FGSM. And FGM is the first time that adversarial training has been applied to text classification tasks. The formula is as follows:

𝒓𝒂𝒅𝒗=ϵ𝒈/𝒈2where𝒈=𝒙L(𝜽,𝒙,y).\bm{r_{adv}}=\epsilon\cdot\bm{g}/\|\bm{g}\|_{2}{~{}\rm where~{}}\bm{g}=\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y). (2)

Athalye et al. (2018) proposed a confrontation training method called Projected Gradient Descent (PGD), which obtains the final perturbation value through multiple forward and backward propagation iterations, and the perturbation obtained in each iteration is limited to a set range, if it exceeds this range will be mapped to the “sphere” of the range. To put it simply, “walk in small steps, take a few more steps.” The formula is as follows:

xt1=x+S(xt+α𝒈(𝒙𝒕)/𝒈(𝒙𝒕)2)where𝒈(𝒙𝒕)=𝒙L(𝜽,𝒙𝒕,y).x_{t-1}=\prod_{x+S}(x_{t}+\alpha\cdot\bm{g(x_{t})}/\|\bm{g(x_{t})}\|_{2}){~{}\rm where~{}}\bm{g(x_{t})}=\nabla_{\bm{x}}L(\bm{\theta},\bm{x_{t}},y). (3)

where S=rd;𝒓2ϵS=r\in\mathbb{R}^{d};\|\bm{r}\|_{2}\leq\epsilon is the constraint space of the perturbation, and a is the step length of the “small step”.

After that, although many defenses were broken by Athalye et al. (2018), PGD-based adversarial training is one of the few that can withstand powerful attacks. Athalye et al. (2018) proved that PGD can largely avoid the problem of gradient confusion, but it will still cause high convolution and non-linear loss surface. When K is very small, it will be lightly broken under powerful adversaries. To get more efficient PGD-based adversarial training, it must iteratively calculate the gradient many times, but this will consume a lot of computing resources. Shafahi et al. (2019) proposed a “free” adversarial training algorithm that eliminates the overhead cost of calculating adversarial perturbations by recycling the gradient information computed when updating model parameters. Zhang et al. (2019) effectively reduce the total number of full forward and backward propagations by restricting most of the forward and back propagation within the first layer of the network during adversary updates. Miyato et al. (2019) proposed virtual adversarial training as a regularization method for semi-supervised learning in the text field. Zhu et al. (2020) propose FreeLB for improving the generalization of language models, which performs multiple PGD iterations to attack the embeddings, and simultaneously accumulates the “free” parameter gradients in each iteration.

In this work, we are the first to propose an adversarial training method that simultaneously attacks the input of the model and the weight parameters of other layers to improve the generalization of the model. And our method, DropAttack, is the first to use random masking of some elements to increase the diversity of adversarial attack combinations.

3 The Proposed DropAttack Adversarial Training Method

Refer to caption
Figure 1: Neural network model (a) with standard adversarial Training (b) and with DropAttack (c). (a) A standard neural network with 3 input and 12 weight parameters. (b) An new neural network produced by applying standard adversarial Training on the left, which attacks all inputs. (c) An new neural network produced by applying DropAttack to the network on the left. Assume that each input vecter and weight parameter have a 2/3 and 1/3 probability of being attacked , respectively.

The proposed new adversarial training method is inspired by Dropout, so we named it DropAttack. Standard adversarial training is to explore the optimal parameters to minimize the maximum risk of adversarial attacks. The Min-Max formula is as follows:

min𝜽𝔼(x,y)D[maxradvSL(𝜽,𝒙+radv,y)]\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}[\mathop{{\rm max}}\limits_{r_{adv}\in S}L(\bm{\theta},\bm{x}+r_{adv},y)] (4)

where DD is the data distribution, yy is the label, and L is loss function. radvr_{adv} is the perturbation under maximizing internal risk. SS is the perturbation constraint space. Here we propose a new adversarial training method, DropAttack, which simultaneously attacks the input x of the model and the weight parameters of other layers, and randomly masks some attacks. The overall procedure is shown in Algorithm 1. The Min-Max formula of DropAttack can be expressed as:

min𝜽𝔼(x,y)D[maxrxSL(𝜽,𝒙+𝑴𝒙𝒓𝒙,y)+maxrθSL(𝜽+𝑴𝜽𝒓𝜽,𝒙,y)]\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}[\mathop{{\rm max}}\limits_{r_{x}\in S}L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r_{x}},y)+\mathop{{\rm max}}\limits_{r_{{\theta}}\in S}L(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r_{\theta}},\bm{x},y)] (5)

where 𝒓𝒙\bm{r_{x}} and 𝒓𝜽\bm{r_{\theta}} are the perturbation of the input x and the parameter 𝜽\bm{\theta} under maximizing the internal risk. We respectively approximate these values by linearizing 𝒙L(𝜽,𝒙𝒕,y)\nabla_{\bm{x}}L(\bm{\theta},\bm{x_{t}},y) and 𝜽L(𝜽,𝒙𝒕,y)\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x_{t}},y) around 𝒙\bm{x} and 𝜽\bm{\theta}. Using the linear approximation in equation (6) and the L2 norm constraint, the resulting adversarial perturbation is

𝒓𝒙ϵx𝒙L(𝜽,𝒙,y)/𝒙L(𝜽,𝒙,y)2;𝒓𝜽ϵθ𝜽L(𝜽,𝒙,y)/𝜽L(𝜽,𝒙,y)2\bm{r_{x}}\leftarrow\epsilon_{x}\cdot\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)\|_{2};~{}\bm{r_{\theta}}\leftarrow\epsilon_{\theta}\cdot\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)\|_{2} (6)

These perturbation can be easily calculated using backpropagation in a neural network.

𝑴𝒙\bm{M_{x}} and 𝑴𝜽\bm{M_{\theta}} are the Random attack masks of rxr_{x} and rθr_{\theta} respectively. For any attack mask, 𝑴𝒊𝒋\bm{M_{ij}} is a matrix of independent Bernoulli random variables with the same dimension as the perturbation value, and the probability of each Bernoulli random variable is 1. Multiplying the perturbation matrix and the attack mask matrix will randomly mask a part of the element values in the perturbation matrix.

Input: Training samples 𝒳\mathcal{X} , model parameter 𝜽\bm{\theta}, perturbation coefficients ϵx\epsilon_{x} and ϵθ\epsilon_{\theta}, Attack probabilities pxp_{x} and pθp_{\theta} , Learning rate τ\tau
for epoch=1Nep{\rm epoch}=1\ldots N_{ep} do
      for (𝐱,y)𝒳(\bm{x},y)\in\mathcal{X} do
             Compute gradient 𝒈\bm{g} of parameter 𝒙\bm{x} and 𝜽\bm{\theta}:
                   𝒈𝒙𝒙L(𝜽,𝒙,y)\bm{g_{x}}\leftarrow\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y); 𝒈𝜽𝜽L(𝜽,𝒙,y)\bm{g_{\theta}}\leftarrow\nabla_{\bm{\theta}}L(\bm{\theta},\bm{x},y)
            Compute perturbation 𝒓𝒙\bm{r_{x}} and 𝒓𝜽\bm{r_{\theta}}:
                   𝒓𝒙ϵx𝒈𝒙/𝒈𝒙2\bm{r_{x}}\leftarrow\epsilon_{x}\cdot\bm{g_{x}}/\|\bm{g_{x}}\|_{2}; 𝒓𝜽ϵθ𝒈𝜽/𝒈𝜽2\bm{r_{\theta}}\leftarrow\epsilon_{\theta}\cdot\bm{g_{\theta}}/\|\bm{g_{\theta}}\|_{2}
            Random attack 𝑴𝒙\bm{M_{x}} and 𝑴𝜽\bm{M_{{\theta}}} mask:
                   𝑴𝒙𝒊𝒋Bernoulli(px)\bm{M_{{x}_{ij}}}\sim{\rm Bernoulli}(p_{x}); 𝑴𝜽𝒊𝒋Bernoulli(pθ)\bm{M_{{{\theta}}_{ij}}}\sim{\rm Bernoulli}(p_{\theta})
            Compute adversarial gradient 𝒈adv\bm{g}_{adv}:
                   𝒈adv𝜽[L(𝜽,𝒙+𝑴𝒙𝒓𝒙,y)+L(𝜽+𝑴𝜽𝒓𝜽,𝒙,y)]\bm{g}_{adv}\leftarrow\nabla_{\bm{\theta}}[L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r_{x}},y)+L(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r_{\theta}},\bm{x},y)]
            Update parameter 𝜽\bm{\theta}:
                   𝜽𝜽τ(𝒈+𝒈adv)\bm{\theta}\leftarrow\bm{\theta}-\tau(\bm{g}+\bm{g}_{adv})
       end for
      
end for
Output: θ\theta
Algorithm 1 DropAttack Adversarial Training

The attack on the input in the text model is essentially an attack on the weight parameters of the embedding layer. We can see that Figure 1 (a) is a standard neural network with 12 weight parameters. Figure 1 (b) is an new neural network produced by applying standard adversarial Training on the left, which attacks all inputs. And Figure 1 (c) is an new neural network after DropAttack is applied, and some of the weights are added with the perturbation calculated by the corresponding gradient. Compared with standard adversarial training, DropAttack maximizes internal risk through a wider range of attacks (not limited to the input layer). And randomly mask some dimensional perturbation, which will generate a more robust and diversified embedding space.

3.1 DropAttack with multiple internal ascent steps

Most of the latest adversarial training are PGD-based methods. PGD-based methods are a series of adversarial training algorithms (Kurakin et al., 2017) for solving the maximum-minimum problem of cross-entropy loss, which can be reliably achieved by using multiple projection gradient ascent steps and then performing an SGD (Stochastic Gradient Descent) step. Multiple PGD iterations can get a more optimized perturbation. Obviously, DropAttack can also use multiple forward and backward propagation methods to update the perturbation. In practice, when DropAttack is used, each ascent step of updating perturbation is optimizing for a different weight network. Because 𝒓𝒕\bm{r_{t}} and 𝒓𝒕𝟏\bm{r_{t-1}} are an iterative relationship, this will affect the inability to obtain the optimal perturbation. Therefore, we use the same mask matrix for each step of perturbation update. The overall procedure is shown in Algorithm 2. We first calculate the initial gradients 𝒈𝒙(𝟎)=𝒙L(𝜽,𝒙,y)\bm{g^{(0)}_{x}}=\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y) and 𝒈𝜽(𝟎)=𝜽L(𝜽,𝒙,y)\bm{g^{(0)}_{\theta}}=\nabla_{\bm{\theta}}L(\bm{\theta},\bm{x},y) of 𝒙\bm{x} and 𝜽\bm{\theta}, and initial perturbation 𝒓𝒙(𝟎)=ϵx𝒈𝒙(𝟎)/𝒈𝒙(𝟎)2\bm{r^{(0)}_{x}}=\epsilon_{x}\cdot\bm{g^{(0)}_{x}}/\|\bm{g^{(0)}_{x}}\|_{2}, 𝒓𝜽(𝟎)=ϵθ𝒈𝜽(𝟎)/𝒈𝜽(𝟎)2\bm{r^{(0)}_{\theta}}=\epsilon_{\theta}\cdot\bm{g^{(0)}_{\theta}}/\|\bm{g^{(0)}_{\theta}}\|_{2}. Then generate random attack masks 𝑴𝒙\bm{M_{x}} and 𝑴𝜽\bm{M_{{\theta}}}, which are used in the forward and backward propagation for each step of perturbation update. That is, the mask matrix is fixed after the first step of perturbation update. The perturbation values 𝒓𝒙(𝒕)=ϵx𝒈𝒙(𝒕)/𝒈𝒙(𝒕)2\bm{r^{(t)}_{x}}=\epsilon_{x}\cdot\bm{g^{(t)}_{x}}/\|\bm{g^{(t)}_{x}}\|_{2} and 𝒓𝜽(𝒕)=ϵθ𝒈𝜽(𝒕)/𝒈𝜽(𝒕)2\bm{r^{(t)}_{\theta}}=\epsilon_{\theta}\cdot\bm{g^{(t)}_{\theta}}/\|\bm{g^{(t)}_{\theta}}\|_{2} are updated through the gradient ascend in each iteration, where 𝒈𝒙(𝒕)=𝒈𝒙(𝒕𝟏)+1K𝒙L(𝜽,𝒙+𝑴𝒙𝒓𝒙(𝒕𝟏),y)\bm{g^{(t)}_{x}}=\bm{g^{(t-1)}_{x}}+\frac{1}{K}\nabla_{\bm{x}}L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r^{(t-1)}_{x}},y) and 𝒈𝜽(𝒕)=𝒈𝜽(𝒕𝟏)+1K𝜽L(𝜽,𝒙+𝑴𝜽𝒓𝜽(𝒕𝟏),y)\bm{g^{(t)}_{\theta}}=\bm{g^{(t-1)}_{\theta}}+\frac{1}{K}\nabla_{\bm{\theta}}L(\bm{\theta},\bm{x}+\bm{M_{{\theta}}}\cdot\bm{r^{(t-1)}_{\theta}},y). Finally, the model parameter 𝜽\bm{\theta} is updated once with the accumulated gradient of each adversarial iteration, the DropAttack-K of K iterations can be expressed as:

min𝜽𝔼(x,y)D{1Kt=0K1[maxrx(t)SL(𝜽,𝒙+𝑴𝒙𝒓𝒙(𝒕),y)+maxrθ(t)SL(𝜽+𝑴𝜽𝒓𝜽(𝒕),𝒙,y)]}.\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}\{\dfrac{1}{K}\sum^{K-1}_{t=0}[\mathop{{\rm max}}\limits_{r^{(t)}_{x}\in S}L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r^{(t)}_{x}},y)+\mathop{{\rm max}}\limits_{r^{(t)}_{{\theta}}\in S}L(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r^{(t)}_{\theta}},\bm{x},y)]\}. (7)
Input: Training samples 𝒳\mathcal{X} , model parameter 𝜽\bm{\theta}, perturbation coefficient ϵx\epsilon_{x} and ϵθ\epsilon_{\theta}, Attack probability pxp_{x} and pθp_{\theta}, number of forward-backward propagation K, Learning rate τ\tau
for epoch=1Nep{\rm epoch}=1\ldots N_{ep} do
      for (𝐱,y)𝒳(\bm{x},y)\in\mathcal{X} do
            for 𝐭=1,2K\bm{t}=1,2\ldots K do
                  
                  Compute initial gradient 𝒈\bm{g} of parameter 𝒙\bm{x} and 𝜽\bm{\theta}:
                         𝒈𝒙(𝟎)𝒙L(𝜽,𝒙,y)\bm{g^{(0)}_{x}}\leftarrow\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y); 𝒈𝜽(𝟎)𝜽L(𝜽,𝒙,y)\bm{g^{(0)}_{\theta}}\leftarrow\nabla_{\bm{\theta}}L(\bm{\theta},\bm{x},y)
                  Compute initial perturbation 𝒓𝒙(𝟎)\bm{r^{(0)}_{x}} and 𝒓𝜽(𝟎)\bm{r^{(0)}_{\theta}}:
                         𝒓𝒙(𝟎)ϵx𝒈𝒙(𝟎)/𝒈𝒙(𝟎)2\bm{r^{(0)}_{x}}\leftarrow\epsilon_{x}\cdot\bm{g^{(0)}_{x}}/\|\bm{g^{(0)}_{x}}\|_{2}; 𝒓𝜽(𝟎)ϵθ𝒈𝜽(𝟎)/𝒈𝜽(𝟎)2\bm{r^{(0)}_{\theta}}\leftarrow\epsilon_{\theta}\cdot\bm{g^{(0)}_{\theta}}/\|\bm{g^{(0)}_{\theta}}\|_{2}
                  if 𝐭=1\bm{t}=1 then
                        Generate random attack masks 𝑴𝒙\bm{M_{x}} and 𝑴𝜽\bm{M_{{\theta}}}:
                               𝑴𝒙𝒊𝒋Bernoulli(px)\bm{M_{{x}_{ij}}}\sim{\rm Bernoulli}(p_{x}); 𝑴𝜽𝒊𝒋Bernoulli(pθ)\bm{M_{{{\theta}}_{ij}}}\sim{\rm Bernoulli}(p_{\theta})
                   end if
                  
                  Update the perturbation 𝒓\bm{r} via gradient ascend :
                         𝒈𝒙(𝒕)𝒈𝒙(𝒕𝟏)+1K𝒙L(𝜽,𝒙+𝑴𝒙𝒓𝒙(𝒕𝟏),y)\bm{g^{(t)}_{x}}\leftarrow\bm{g^{(t-1)}_{x}}+\frac{1}{K}\nabla_{\bm{x}}L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r^{(t-1)}_{x}},y)
                        𝒈𝜽(𝒕)𝒈𝜽(𝒕𝟏)+1K𝜽L(𝜽,𝒙+𝑴𝜽𝒓𝜽(𝒕𝟏),y)\bm{g^{(t)}_{\theta}}\leftarrow\bm{g^{(t-1)}_{\theta}}+\frac{1}{K}\nabla_{\bm{\theta}}L(\bm{\theta},\bm{x}+\bm{M_{{\theta}}}\cdot\bm{r^{(t-1)}_{\theta}},y)
                        𝒓𝒙(𝒕)ϵx𝒈𝒙(𝒕)/𝒈𝒙(𝒕)2\bm{r^{(t)}_{x}}\leftarrow\epsilon_{x}\cdot\bm{g^{(t)}_{x}}/\|\bm{g^{(t)}_{x}}\|_{2}; 𝒓𝜽(𝒕)ϵθ𝒈𝜽(𝒕)/𝒈𝜽(𝒕)2\bm{r^{(t)}_{\theta}}\leftarrow\epsilon_{\theta}\cdot\bm{g^{(t)}_{\theta}}/\|\bm{g^{(t)}_{\theta}}\|_{2}
             end for
            Update parameter 𝜽\bm{\theta}:
                   𝜽𝜽τ(𝒈𝒙(𝑲)+𝒈𝜽(𝑲))\bm{\theta}\leftarrow\bm{\theta}-\tau(\bm{g^{(K)}_{x}}+\bm{g^{(K)}_{\theta}})
       end for
      
end for
Output: θ\theta
Algorithm 2 PGD-based DropAttack-K Adversarial Training

The training process is equivalent to replacing the original batch with a K-times virtual batch, consisting of samples whose embeddings are 𝒙+𝑴𝒙𝒓𝒙(𝟎),𝒙+𝑴𝒙𝒓𝒙(𝟏),,𝒙+𝑴𝒙𝒓𝒙(𝒌𝟏)\bm{x}+\bm{M_{x}}\cdot\bm{r^{(0)}_{x}},\bm{x}+\bm{M_{x}}\cdot\bm{r^{(1)}_{x}},\ldots,\bm{x}+\bm{M_{x}}\cdot\bm{r^{(k-1)}_{x}}. Similarly, multiple virtual neural networks with different weight parameters will be trained, and their weight parameters are 𝜽+𝑴𝜽𝒓𝜽(𝟎),𝜽+𝑴𝜽𝒓𝜽(𝟏),,𝜽+𝑴𝜽𝒓𝜽(𝒌𝟏)\bm{{\theta}}+\bm{M_{{\theta}}}\cdot\bm{r^{(0)}_{\theta}},\bm{{\theta}}+\bm{M_{{\theta}}}\cdot\bm{r^{(1)}_{\theta}},\ldots,\bm{{\theta}}+\bm{M_{{\theta}}}\cdot\bm{r^{(k-1)}_{\theta}} respectively. It is worth noting that our perturbation constraint SS does not take an additional fixed value, and only uses the L2 norm to constrain the gradient value, because we need the diversity of perturbations at each step, rather than forcibly constraining it in a fixed spherical space. In fact, our DropAttack-K also inherits the ”free” ability, using the gradient average calculated by each backpropagation for external minimization.

Intuitively, compared to the previous adversarial training method, DropAttack can generate a richer adversarial sample in the spherical space of the original sample, which can prevent the model from overfitting on the adversarial sample to a certain extent. Empirically, there is a certain gap between the features of the training dataset and the features of the test dataset in the high-dimensional feature space. Improving generalization is essentially to narrow the feature distribution gap between the training dataset and the test dataset. However, this gap is uncertain, so more diverse adversarial samples are needed to fill this gap. In theory, DropAttack has a more significant improvement in the generalization of the model.

4 Experiment

In this section, we test and analyze the effect of DropAttack on three NLP data sets and two CV data sets. In addition, we also analyze the ability of DropAttack to prevent overfitting under different sizes of training data. Additional experimental details and results are provided in the Appendix A.

4.1 Datasets

Five public datasets, IMDB (Maas et al., 2011), PHEME (Zubiaga et al., 2016), AGnews (Zhang et al., 2015), MNIST (LeCun, 1998) and GIFAR-10 (Krizhevsky, 2009), are used to evaluate our DropAttack algorithm. The brief description of the datasets are shown in Table 1. IMDB (Maas et al., 2011) is a standard benchmark movie review dataset for sentiment analysis. PHEME dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The AGnews topic classification dataset is constructed by Xiang Zhang from the original AG news sources. MNIST is s standard and commonly used toy dataset of handwritten digits. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.  We divide each dataset into training set, validation set and test set.

Table 1: Overview of the datasets used in this paper.
Dataset Task Classes Training Validation Test
IMDB Sentiment analysis 2 40000 5000 5000
PHEME Rumor detection 2 5145 643 637
AGnews News classification 4 110000 10000 7600
MNIST Image classification 10 50000 10000 10000
CIFAR-10 Image classification 10 40000 10000 10000

4.2 Experimental Setup

In the experiment, we chose the rnn-based model and the cnn-based model to handle nlp tasks and cv tasks, respectively. For nlp tasks, IMDB uses an LSTM (Hochreiter & Schmidhuber, 1997) (300-300 dim) layer and a fully connected layer (300-2 dim); PHEME uses a BiGRU (Cho et al., 2014) (300-300 dim) layer and a fully connected layer (600-2 dim); AGnews uses two BiLSTM (Schuster & Paliwal, 1997) (300-300 dim) layers and a fully connected layer (600-4 dim). And for cv tasks, MNIST uses the LeNet-5 (LeCun et al., 1998) model, which contains two layers of CNN(1-6-16 channels, kernel size = 5) and three fully connected layers (400-120-84-10 dim); CIFAR-10 uses the VGGNet-16 (Simonyan & Zisserman, 2014) model, which ontains 13 layers of CNN (3-64-64-128-128-256-256-256-512-512-512-512 channels, kernel size = 3) and three fully connected layers (512-4096-4096-10 dim). And all models are implemented based on Pytorch, the Batch sizes value is 128, the optimizer is Adam, and the learning rate is 0.001.

Methods used for comparison:

L1 (Tibshirani, 1996) and L2 (Tikhonov, 1943) regularization: Regular constraints are added to the original objective function, the L1 norm conforms to the Laplace distribution, and the L2 norm conforms to the Gaussian distribution.

Dropout (Srivastava et al., 2014): It is a commonly used regularization method that randomly removes some neural units and all their input and output connections during the training process. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures effciently.

FGSM (Goodfellow et al., 2015): The first adversarial training method uses the sign function to generate the perturbation based on the direction of the gradient ascend.

FGM Miyato et al. (2017): Compared with the FGSM method, FGM improves the calculation method of perturbation. And the perturbation is obtained by dividing the linear approximation of the gradient by the L2 regularization of the gradient.

PGD Athalye et al. (2018): An adversarial training method that requires K forward-backward passes through the network to calculate the optimal perturbation.

FreeAT Shafahi et al. (2019) : A PGD-baesd adversarial training method, which accelerates the training process by sharing the gradient of internal maximization and external minimization.

FreeLB Zhu et al. (2020): A PGD-baesd adversarial training method, which uses the gradient average calculated by K steps to update the model parameters.

4.3 Experimental Results and Discussion

Table 2: Comparing the effects of DropAttack with various well-known regularization and other state-of-the-art adversarial training methods. The reported results are calculated from 5 runs with the same hyper-parameters except for the random seeds. The ”Original models” used in the five datasets are LSTM, BiGRU, BiLSTM, LeNet and VGGNet, which are common neural-based deep learning models. DropAttack-(I) and DropAttack-(I&W) are the attack input only and the simultaneous attack input and weight parameters, respectively. The best method and the best competitor are highlighted by bold and underline, respectively.
Methods NLP Datasets CV Datasets
IMDB PHEME* AGnews MNIST CIFAR-10
Original model 88.12 84.08/78.99 91.87 98.95 84.67
Original model + L1 88.02 85.34/79.55 92.29 99.07 84.74
Original model + L2 88.27 85.67/81.29 92.43 99.14 84.63
Original model + Dropout 88.64 85.85/81.08 92.22 99.07 85.39
Original model + FGSM 88.04 85.61/80.40 92.54 99.16 71.86
Original model + FGM 89.26 84.97/78.52 92.53 99.15 85.64
Original model + PGD 89.38 85.28/79.30 92.76 99.10 85.57
Original model + FreeAT 89.17 85.29/79.32 92.45 99.09 85.45
Original model + FreeLB 89.25 85.69/81.18 92.58 99.11 85.47
Original model + DropAttack-(I) 89.76 85.75/81.33 93.35 99.16 86.05
Original model + DropAttack-(I&W) 90.36 87.15/81.31 93.37 99.27 86.09
  • *

    Note that although PHEME is a two-class classification, the labels are not balanced, so we use accuracy and f1-score (accuracy/f1) as the evaluation criteria.

We can find from Table 2 that compared to the original model without DropAttack, the performance of the model after using DropAttack has improved on the five datasets, and the improvements on the three nlp datasets are 2.24%, 3.07% and 1.5% respectively. Compared with other regularization methods, the overall effect of adversarial training is better. Among them, the efficiency of FGSM is relatively unstable, and the performance on IMDB and CIFAR-10 is only 88.04% and 71.86% respectively. Because the naive perturbation value may destroy the distribution of the original data. It should be noted that PGD, FreeAT, and FreeLB are all PGD-based adversarial training methods, which require multiple forward and back propagation iterations to calculate the optimal perturbation value. In the experiment, the number of forward-backward passes k is 3. Our method achieves state-of-the-art performance on five datasets by calculating the perturbation value through only one backpropagation. Note that we study the influence of the number of forward-backward propagation on DropAttack later. We can see that the performance of DropAttack-(I&W) is better than DropAttack-(I) on all five datasets, which shows that the weight’s adversarial training is work. In addition, based on the results in Table 2, our method has more improvements on the NLP datasets and less on the CV datasets, which is consistent with the experimental results in other adversarial training research papers (Cheng et al., 2019; Zhao et al., 2018; Zhu et al., 2020). We believe that the reason for this phenomenon is that the perturbation is directly added to the original pixel value for the picture, and the text is not directly modified to the word, the perturbation value is added to the embedding vector. The pixel value of the image is fixed, and the perturbation may change the distribution of the original sample. However, the word vector of the text is not unique and certain, so it is more likely to learn a better word vector after adding perturbation.

Refer to caption
Figure 2: Training and validation accuracy of different models (TextRNN, TextCNN, RCNN) with and without DropAttack on IMDB dataset.

In order to show the effect of DropAttack in preventing neural network overfitting more clearly, classification experiments were done with many different models of keeping all hyperparameters, including e and p, fixed. As shown in Figure 2, Training and validation accuracy obtained for these models of different architectures (TextRNN, TextCNN and TextRCNN) as training progresses. The training accuracy under DropAttack training is basically the same as that under standard training, but the validation accuracy is higher, which proves that using the DropAttack adversarial training method can alleviate model overfitting. Furthermore, we can see that DropAttack adversarial training may be more difficult to converge in the early stage, because it is more difficult to optimize the target under attack, but after enough weight updates and learning, the validation accuracy of the model is relatively more stable. The key point is that DropAttack gives a obvious improvement across all nerual networks of different architectures, without using hyperparameters that were tuned specififically for each architecture.

In addition, we also study the efficiency of DropAttack under training datasets of different sizes. We divide the IMDB training set into different sizes, and use the LSTM model with the same structure as the above, and the experimental results are shown in Table 3. Compared with standard training, the performance of the model using the DropAttack training method on all seven training datasets of different sizes has been improved by more than 2%. Furthermore, we found that DropAttack can achieve and even exceed the accuracy of standard training using only half of the training data. For example, DropAttack can achieve 83.88% using 2500 training data, and the accuracy of using 5000 data based on standard training is 82.62%.

Table 3: The ability of our method DropAttack to prevent overfitting under different sizes of training data. The hyperparameters are set to k=3,ϵx=ϵθ=5,px=pθ=0.7k=3,\epsilon_{x}=\epsilon_{\theta}=5,p_{x}=p_{\theta}=0.7.
Methods Size of the training set
100 500 1000 2500 5000 10000 20000 40000
Standard Training 63.26 74.26 78.30 81.14 82.62 84.92 85.42 88.12
DropAttack-3 Training 65.46 76.34 80.70 83.88 85.22 87.02 88.86 90.42
Improvement \uparrow 2.20 2.08 2.40 2.66 2.60 2.10 3.42 2.30
Table 4: Comparing the performance of DropAttack under different number of forward-backward propagation K. The reported results are calculated from 5 runs with the same hyper-parameters.
Methods IMDB PHEME AGnews MNIST CIFAR-10
DropAttack-1 90.36 87.15/82.31 93.37 99.27 86.09
DropAttack-2 90.38 87.27/82.43 93.38 99.24 86.09
DropAttack-3 90.42 87.36/82.78 93.34 99.27 86.07
DropAttack-4 90.43 87.25/82.63 93.41 99.26 86.10
DropAttack-5 90.42 87.26/82.67 93.39 99.25 86.09

PGD-based DropAttack-k. we study the influence of the number of forward-backward propagation on DropAttack, and the experimental results are shown in Table 4. We can find that multiple iterative calculations can indeed further improve the generalization of the neural network, because multiple iterations are more likely to get the optimal disturbance value. However, it is clear that more forward-backward propagation will greatly increase the training time. Therefore, a reasonable number of iterations K can be selected based on time and computing resources.

5 Theoretical Analysis

We provide another theoretical perspectives to explain why the adversarial training method DropAttack can be used as regularization to improve the generalization of the model and prevent overfitting. According to Section 3, the task of DropAttack can be to minimize the maximum internal adversarial risk, that is, to approximately optimize the following goals:

min𝜽𝔼(x,y)D[maxrxSL(𝜽,𝒙+𝑴𝒙𝒓𝒙,y)+maxrθSL(𝜽+𝑴𝜽𝒓𝜽,𝒙,y)]\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}[\mathop{{\rm max}}\limits_{r_{x}\in S}L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r_{x}},y)+\mathop{{\rm max}}\limits_{r_{{\theta}}\in S}L(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r_{\theta}},\bm{x},y)] (8)

For formula 8, we Taylor expand functions f(x)=L(𝜽,𝒙+𝑴𝒙𝒓𝒙,y)f(x)=L(\bm{\theta},\bm{x}+\bm{M_{x}}\cdot\bm{r_{x}},y) and f(θ)=L(𝜽+𝑴𝜽𝒓𝜽,𝒙,y)f(\theta)=L(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r_{\theta}},\bm{x},y) at points (𝒙+𝑴𝒙𝒓𝒙)(\bm{x}+\bm{M_{x}}\cdot\bm{r_{x}}) and (𝜽+𝑴𝜽𝒓𝜽)(\bm{\theta}+\bm{M_{\theta}}\cdot\bm{r_{\theta}}) respectively:

min𝜽𝔼(x,y)D{maxrxS[L(𝜽,𝒙,y)+<𝒙L(𝜽,𝒙,y),𝑴𝒙𝒓𝒙>]+maxrθS[L(𝜽,𝒙,y)+<𝜽L(𝜽,𝒙,y),𝑴𝜽𝒓𝜽>]}\begin{split}\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}\{&\mathop{{\rm max}}\limits_{r_{x}\in S}[L(\bm{\theta},\bm{x},y)+<\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y),\bm{M_{x}}\cdot\bm{r_{x}}>]\\ +&\mathop{{\rm max}}\limits_{r_{{\theta}}\in S}[L(\bm{\theta},\bm{x},y)+<\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y),\bm{M_{\theta}}\cdot\bm{r_{\theta}}>]\}\end{split} (9)

Then, after substituting the values 𝒓𝒙=ϵx𝒙L(𝜽,𝒙,y)/𝒙L(𝜽,𝒙,y)2,𝒓𝜽=ϵθ𝜽L(𝜽,𝒙,y)/𝜽L(𝜽,𝒙,y)2\bm{r_{x}}=\epsilon_{x}\cdot\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)\|_{2},~{}\bm{r_{\theta}}=\epsilon_{\theta}\cdot\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)\|_{2} of the perturbation that maximize the antagonistic Loss, the following formula 10 is obtained:

min𝜽𝔼(x,y)D[L(𝜽,𝒙,y)+<𝒙L(𝜽,𝒙,y),𝑴𝒙ϵx𝒙L(𝜽,𝒙,y)/𝒙L(𝜽,𝒙,y)2>+L(𝜽,𝒙,y)+<𝜽L(𝜽,𝒙,y),𝑴𝜽ϵθ𝜽L(𝜽,𝒙,y)/𝜽L(𝜽,𝒙,y)2>]\begin{split}\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}&[L(\bm{\theta},\bm{x},y)+<\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y),\bm{M_{x}}\cdot\epsilon_{x}\cdot\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)\|_{2}>\\ +&L(\bm{\theta},\bm{x},y)+<\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y),\bm{M_{\theta}}\cdot\epsilon_{\theta}\cdot\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)\|_{2}>]\end{split} (10)
min𝜽𝔼(x,y)D[L(𝜽,𝒙,y)+ϵx𝑴𝒙<𝒙L(𝜽,𝒙,y),𝒙L(𝜽,𝒙,y)/𝒙L(𝜽,𝒙,y)2>+L(𝜽,𝒙,y)+ϵθ𝑴𝜽<𝜽L(𝜽,𝒙,y),𝜽L(𝜽,𝒙,y)/𝜽L(𝜽,𝒙,y)2>]\begin{split}\Rightarrow\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}&[L(\bm{\theta},\bm{x},y)+\epsilon_{x}\cdot\bm{M_{x}}<\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y),\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)\|_{2}>\\ +L(\bm{\theta}&,\bm{x},y)+\epsilon_{\theta}\cdot\bm{M_{\theta}}<\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y),\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)/\|\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)\|_{2}>]\end{split} (11)
min𝜽𝔼(x,y)D[2L(𝜽,𝒙,y)+ϵx𝑴𝒙𝒙L(𝜽,𝒙,y)2+ϵθ𝑴𝜽𝜽L(𝜽,𝒙,y)2]\begin{split}\Rightarrow\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}[2L(\bm{\theta},\bm{x},y)+\epsilon_{x}\cdot\|\bm{M_{x}}\cdot\nabla_{\bm{x}}L(\bm{\theta},\bm{x},y)\|_{2}+\epsilon_{\theta}\cdot\|\bm{M_{\theta}}\cdot\nabla_{\bm{{\theta}}}L(\bm{\theta},\bm{x},y)\|_{2}]\end{split} (12)
min𝜽𝔼(x,y)D[2Loss+ϵx𝑴𝒙𝒈𝒙2+ϵθ𝑴𝜽𝒈𝜽2]\begin{split}\Rightarrow\mathop{{\rm min}}\limits_{\bm{\theta}}\mathbb{E}_{(x,y)\sim D}[2{\rm Loss}+\epsilon_{x}\cdot\|\bm{M_{x}}\cdot\bm{g_{x}}\|_{2}+\epsilon_{\theta}\cdot\|\bm{M_{\theta}}\cdot\bm{g_{\theta}}\|_{2}]\end{split} (13)

We can see the final optimization function formula 13, which actually adds the implicit gradient regularization ϵx𝑴𝒙𝒈𝒙2\epsilon_{x}\cdot\|\bm{M_{x}}\cdot\bm{g_{x}}\|_{2} and ϵθ𝑴𝜽𝒈𝜽2\epsilon_{\theta}\cdot\|\bm{M_{\theta}}\cdot\bm{g_{\theta}}\|_{2} to a certain proportion of input 𝒙\bm{x} and parameters 𝜽\bm{\theta} after loss every time the parameter 𝜽\bm{\theta} is updated. Gradient penalty pushes the gradient of some parameters and inputs to approach zero, so that the model is likely to be optimized to a flatter minimum.

In order to further visually analyze the effectiveness of the proposed method, we draw the high-dimensional non-convex loss function with a visualization method proposed by (Li et al., 2018). We visualize the loss landscapes around the minima of the empirical risk generated by standard training or DropAttack, the 2D visualization are plotted in Figure 3 and the 3D visualization are in Figure 4. Additional loss visualization are provided in the Appendix. Define two direction vectors, α\alpha and β\beta with the same dimensions as θ\theta, drawn from a Gaussian distribution with zero mean and a scale of the same order of magnitude as the variance of layer weights. Then we choose a center point θ\theta^{*} and add a linear combination of α\alpha and β\beta to obtain a loss that is a function of the contribution of the two random direction vectors. And we define a grid of points to evaluate the loss on i.e. range of values for δ\delta and η\eta for which L(δ,η)L(\delta,\eta) is evaluated and stored.

L(δ,η)=(θ+δα+δβ)L(\delta,\eta)=\mathcal{L}(\theta^{*}+\delta\alpha+\delta\beta)

The results show that the test loss L(δ,η)L(\delta,\eta)becomes lower and flatter during the training with DropAttack. And DropAttack indeed selects flatter loss landscapes via masked adversarial perturbations. Many studies have shown that a flatter loss landscape usually means better generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2019; Ishida et al., 2020).

Refer to caption
Figure 3: 2D visualization of the minima of the empirical risk generated by standard training (left) and DropAttack (right) on IMDB dataset.
Refer to caption
Figure 4: 3D visualization of the minima of the empirical risk generated by standard training (left) and DropAttack (right) on IMDB dataset.

6 Conclusion

In this work, we propose a masked weight adversarial training method, DropAttack, to improve the generalization ability of neural network models and prevent overfitting. The proposed algorithm uses the gradient method to attack the input and weight parameters according to a certain probability, and enhances the generalization of the model by minimizing the resultant adversarial risk. Experimental results prove that DropAttack can effectively improve the generalization of models and prevent overfitting, especially in the field of NLP. In addition, we theoretically proved that our algorithm can regularize the gradient of model parameters. Therefore, DropAttack can improve the robustness and generalization of the model. The current adversarial training still consumes more computing resources and time than the standard stochastic gradient descent, so it is a valuable research direction to accelerate the adversarial training while improving generalization in the future.

References

  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. PMLR, 2018.
  • Cheng et al. (2019) Yong Cheng, Lu Jiang, and Wolfgang Macherey. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4324–4333, 2019.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Ishida et al. (2020) Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi Sugiyama. Do we need zero training loss after achieving zero training error? In International Conference on Machine Learning, pp. 4604–4614. PMLR, 2020.
  • Keskar et al. (2019) Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail Smelyanskiy. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, 2019.
  • Krizhevsky (2009) A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
  • Kurakin et al. (2017) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ICLR, 2017.
  • LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.  6391–6401, 2018.
  • Maas et al. (2011) Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp.  142–150, 2011.
  • Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. ICLR, 2017.
  • Miyato et al. (2019) Takeru Miyato, Shin-Ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2019.
  • Morgan & Bourlard (1989) Nelson Morgan and Hervé Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. Advances in neural information processing systems, 2:630–637, 1989.
  • Schuster & Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Shafahi et al. (2019) Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.  3358–3369, 2019.
  • Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Tikhonov (1943) Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp.  195–198, 1943.
  • Zhang et al. (2019) Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32:227–238, 2019.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28:649–657, 2015.
  • Zhao et al. (2018) Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations, 2018.
  • Zhu et al. (2020) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. In ICLR, 2020.
  • Zubiaga et al. (2016) Arkaitz Zubiaga, Maria Liakata, and Rob Procter. Learning reporting dynamics during breaking news for rumour detection in social media. arXiv preprint arXiv:1610.07363, 2016.

Appendix A Additional Experimental Details and Results

In order to prove the effectiveness of DropAttack, we conducted many experiments on five datasets: IMDB, PHEME, AGnews, MNIST, CIFAR-10. The detailed experimental settings and results are shown in Table 6, Table 7, Table 8, Table 9, Table 10. The square brackets after DropAttack indicate the object of perturbation. For example, DropAttack[Embedding, Lstm.ih.w] means adding perturbation to Embedding and Lstm.ih.w parameters.

Table 5: Experimental details on the IMDB dataset.
Method Accuracy (%)
LSTM 88.12
LSTM + DropAttack[Embedding] ( e = 5, p = 0.5 ) 89.76
LSTM + DropAttack[Embedding, Fc.w] ( e = 5, p = 0.5 ) 88.60
LSTM + DropAttack[Lstm.hh.w] ( e = 5, p = 0.5 ) 86.64
LSTM + DropAttack[Lstm.ih.w] ( e = 5, p = 0.5 ) 89.80
LSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.5 ) 87.90
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.5 ) 90.21
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 3, p = 0.7 ) 90.22
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.7 ) 90.34
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 7, p = 0.7 ) 90.36
LSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.5 ) 89.56
LSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.6 ) 89.57
Table 6: Experimental details on the PHEME dataset.
Method Accuracy/F1 score (%)
BiLSTM 84.08/78.99
BiLSTM + DropAttack[Embedding] ( e = 5, p = 0.5 ) 85.69/79.97
BiLSTM + DropAttack[Lstm.hh.w] ( e = 5, p = 0.5 ) 86.00/80.03
BiLSTM + DropAttack[Lstm.ih.w] ( e = 5, p = 0.5 ) 83.40/78.64
BiLSTM + DropAttack[Fc.w] ( e = 5, p = 0.5 ) 84.60/79.32
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.5 ) 87.14/81.02
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.6 ) 87.14/81.04
BiLSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.5 ) 86.74/80.13
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.7 ) 87.15/81.31
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.8 ) 87.11/81.24
BiLSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.5 ) 85.54/79.57
BiLSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.7 ) 85.35/79.36
Table 7: Experimental details on the AGnews dataset.
Method Accuracy (%)
BiGRU 91.87
BiGRU+ DropAttack[Embedding] ( e = 5, p = 0.5 ) 93.35
BiGRU+ DropAttack[Embedding] ( e = 5, p = 0.7 ) 93.34
BiGRU + DropAttack[Gru.hh.w] ( e = 5, p = 0.5 ) 92.25
BiGRU + DropAttack[Gru.ih.w] ( e = 5, p = 0.5 ) 92.46
BiGRU + DropAttack[Gru.hh.w, Gru.ih.w] ( e = 5, p = 0.5 ) 92.70
BiGRU + DropAttack[Fc.w] ( e = 5, p = 0.5 ) 92.24
BiGRU + DropAttack[Embedding, Gru.ih.w] ( e = 5, p = 0.5 ) 93.12
BiGRU + DropAttack[Embedding, Gru.ih.w] ( e = 5, p = 0.7 ) 93.37
BiGRU + DropAttack[Embedding, Gru.hh.w] ( e = 5, p = 0.5 ) 92.70
BiGRU + DropAttack[Embedding, Gru.hh.w, Gru.ih.w] ( e = 5, p = 0.5 ) 93.12
BiGRU + DropAttack[Embedding, Gru.ih.w, Gru.ih.w.reverse] ( e = 5, p = 0.5 ) 92.88
Table 8: Experimental details on the MNIST dataset.
Method Accuracy (%)
LeNet-5 98.95
LeNet-5 + DropAttack[Input] ( e = 5, p = 0.5 ) 99.16
LeNet-5 + DropAttack[Conv.1.w] ( e = 5, p = 0.5 ) 99.08
LeNet-5 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.5 ) 99.27
LeNet-5 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.7 ) 99.25
LeNet-5 + DropAttack[Input, Conv.2.w] ( e = 5, p = 0.5 ) 99.12
LeNet-5 + DropAttack[Input, Conv.1.b] ( e = 5, p = 0.5 ) 99.10
LeNet-5 + DropAttack[Conv.2.w] ( e = 5, p = 0.5 ) 99.11
LeNet-5 + DropAttack[Conv.1.b, Conv.2.w] ( e = 5, p = 0.5 ) 99.09
LeNet-5 + DropAttack[Conv.1.w, Conv.2.w] ( e = 5, p = 0.5 ) 98.78
LeNet-5 + DropAttack[Conv.1.w, Fc.1] ( e = 5, p = 0.5 ) 98.93
LeNet-5 + DropAttack[Conv.2.w, Fc.1] ( e = 5, p = 0.5 ) 99.10
LeNet-5 + DropAttack[Conv.2.w, Fc.2] ( e = 5, p = 0.5 ) 99.05
LeNet-5 + DropAttack[Conv1.w, Conv2.w, fc1.w2] ( e = 5, p = 0.5 ) 98.48
Table 9: Experimental details on the CIFAR-10 dataset.
Method Accuracy (%)
VGGNet-16 84.67
VGGNet-16 + DropAttack[Input] ( e = 5, p = 0.5 ) 86.02
VGGNet-16 + DropAttack[Conv.1.w] ( e = 5, p = 0.5 ) 83.61
VGGNet-16 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.5 ) 86.09
VGGNet-16 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.7 ) 86.02
VGGNet-16 + DropAttack[Input, Conv.3.w] ( e = 5, p = 0.7 ) 85.13
VGGNet-16 + DropAttack[Input, Conv.5.w] ( e = 5, p = 0.7 ) 85.13
VGGNet-16 + DropAttack[BatchNorm.1.w] ( e = 5, p = 0.5 ) 85.51
VGGNet-16 + DropAttack[Conv.2.w] ( e = 5, p = 0.5 ) 83.16
VGGNet-16 + DropAttack[Conv.6.w] ( e = 5, p = 0.5 ) 85.27
VGGNet-16 + DropAttack[BatchNorm.8.w] ( e = 5, p = 0.5 ) 85.32
VGGNet-16 + DropAttack[Conv.1.w, BatchNorm.1.w] ( e = 5, p = 0.5 ) 84.41
VGGNet-16 + DropAttack[Input, Conv.1.w, BatchNorm.1.w] ( e = 5, p = 0.5 ) 85.01

For an important discussion and research question: In addition to the perturbation of the input layer, which layer’s weight parameters are better to perturb? According to our experimental results and experience, for the perturbation of hidden layer parameters, the effect of being close to the input layer will be better than that of being close to the output layer. Essentially, the perturbation of the hidden layer is the perturbation of the higher-dimensional embedding of the input. Due to the highly linearization of neural networks, small changes in the input vector may result in changes in the output of multiple layers. Therefore, the deeper weight parameters need to be robust enough to resist overfitting and prevent fitting to overly sensitive input features.

Appendix B Hyperparameter sensitivity analysis

Table 10: The influence of hyperparameters perturbation coefficient e and attack probability p on model performance. The tested dataset is IMDB, and the model structure is the same as that in Table 2. The top ten performance is emphasized in bold.
Perturbation coefficient *
ϵ\epsilon = 0.01 ϵ\epsilon = 0.1 ϵ\epsilon = 1 ϵ\epsilon = 3 ϵ\epsilon = 5 ϵ\epsilon = 7 ϵ\epsilon = 9
Attack probability*
P = 0.0
88.12 88.12 88.12 88.12 88.12 88.12 88.12
P = 0.1 89.60 89.78 89.40 88.74 89.30 88.88 89.26
P = 0.3 89.98 90.12 90.02 90.10 90.04 90.28 89.74
P = 0.5 90.13 90.16 90.01 90.25 90.21 90.30 89.94
P = 0.7 90.17 90.32 90.18 90.20 90.18 90.36 90.14
P = 0.9 89.76 90.22 90.16 89.22 90.12 90.04 90.18
P = 1.0
89.86 89.54 89.74 89.40 90.02 89.90 90.10
  • *

    Note that ϵx\epsilon_{x} and ϵθ\epsilon_{\theta} are uniformly denoted by ϵ\epsilon; pxp_{x} and pθp_{\theta} are uniformly denoted by pp.

DropAttack has three tunable hyperparameter perturbation coefficient ϵ\epsilon, attack probability p (the probability of attacking a weight param in the network) and number of forward-backward propagation K. We explore the effect of varying these hyperparameter. Firstly, We fixed the value of K to 1, and let ϵ\epsilon take the values in [0.001, 0.1, 1, 3, 5, 7, 9] in turn, and p take the values in [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1] in turn, so that there is a total of 7 x 7 = 49 kinds of hyperparameter combinations, and the experimental results are shown in Table 10. The best performance is 90.36% when ϵ=7\epsilon=7 and p = 0.7. When p = 0, it means that the model is standard training without any attack, and its accuracy is the lowest. And we found that the effect is significantly worse when p is less than 0.3 or greater than 0.9. When p =1, random masking is not used, although the performance is improved but not much, because the attack combination is relatively single. Through experimental results, we think that the hyperparameter p is best to be between 0.5 and 0.7.

As shown in Figure 5, we study the impact of different attack probabilities on model performance under different perturbation coefficient coefficients. We can see that the attack probability ranges from 0 to 0.7, and the performance of the model increases as the attack probability increases, because the intensity of the model attack is getting stronger. However, we found that the attack probability increased from 0.7 to 1, and our performance decreased instead. Because an excessively high attack probability will reduce the diversity of attack combinations, when p=0, it is approximately a standard adversarial attack.

Refer to caption
Figure 5: The impact of attack probability (p[0,0.1,0.3,0.5,0.7,0.9,1]p\in[0,0.1,0.3,0.5,0.7,0.9,1]) on model performance under a fixed perturbation coefficient coefficient (ϵ[0.1,1,3,5,7,9]\epsilon\in[0.1,1,3,5,7,9]).

Appendix C Additional Loss Visualization

We visualize the test loss function landscapes of the standard training and DropAttack adversarial training models separately. The 2D and 3D visualization results are shown in Figure 6 and Figure 7, respectively. The structure and parameters of the models are derived from Section 4.2.

Refer to caption
Figure 6: 2D visualization of the minima of the empirical risk selected by standard training and DropAttack on IMDB, PHEME, AGnews, MNIST, CIFAR-10 datasets, respectively.
Refer to caption
Figure 7: 3D visualization of the minima of the empirical risk selected by standard training and DropAttack on IMDB, PHEME, AGnews, MNIST, CIFAR-10 datasets, respectively.