DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks
Abstract
Adversarial training has been proven to be a powerful regularization method to improve the generalization of models. However, current adversarial training methods only attack the original input sample or the embedding vectors, and their attacks lack coverage and diversity. To further enhance the breadth and depth of attack, we propose a novel masked weight adversarial training method called DropAttack, which enhances generalization of model by adding intentionally worst-case adversarial perturbations to both the input and hidden layers in different dimensions and minimize the adversarial risks generated by each layer. DropAttack is a general technique and can be adopt to a wide variety of neural networks with different architectures. To validate the effectiveness of the proposed method, we used five public datasets in the fields of natural language processing (NLP) and computer vision (CV) for experimental evaluating. We compare the proposed method with other adversarial training methods and regularization methods, and our method achieves state-of-the-art on all datasets. In addition, Dropattack can achieve the same performance when it use only a half training data compared to other standard training method. Theoretical analysis reveals that DropAttack can perform gradient regularization at random on some of the input and wight parameters of the model. Further visualization experiments show that DropAttack can push the minimum risk of the model to a lower and flatter loss landscapes. Our source code is publicly available on github111https://github.com/nishiwen1214/DropAttack.
1 Introduction
Deep neural networks (DNNs) (LeCun et al., 2015) have achieved state-of-the-art performance in many artificial intelligence applications, such as natural language processing and computer vision. Regularization methods such as L1 Tibshirani (1996) and L2 (Tikhonov, 1943) regularization, early stopping (Morgan & Bourlard, 1989) and Dropout (Srivastava et al., 2014), play an important role in the impressive performance of deep networks by controlling the model complexity, and thus preventing overfitting and improving generalization. Adversarial training (Goodfellow et al., 2015) was originally proposed as a method to improve the security of machine learning systems in order to train a neural network that is robust to attack samples. And adversarial training is the process of training a model, which minimizes the maximal risk for label-preserving input perturbations. It improves not only robustness to adversarial examples, but also generalization performance for original examples. Goodfellow et al. (2015) have demonstrated that adversarial training can result in regularization; even further regularization than dropout.
In this work, we mainly focus on improving the generalization performance of the model and preventing the model from overfitting, rather than enhancing the robustness of model to attack samples. Miyato et al. (2017) applies adversarial training to text classification tasks, and finds that adversarial training can effectively improve the generalization of text or RNN models on the test set. Most of the recent adversarial training is aimed at the input of the model to attack. Moreover, these existing adversarial training methods add perturbation to every element in the input tensor during the attack. (It should be noted that the input in the NLP field is the embeddings of the text, and in the CV field it is the value of each pixel of the picture. In this paper, it is uniformly called the input for the convenience of expression.) In order to increase the breadth of the attack, we expanded the attack target from the input to the weight parameters of other layers, that is, in the process of adversarial training, while attacking the input, it also attacks the weight parameters of other layers. In each iteration of the attack, we randomly mask the attack on a certain proportion elements instead of attacking all the elements in the input or weight tensor. In this way, exponentially different attack combinations can be obtained, and the internal adversarial loss of the model can be maximized.
In this paper, we show the impact of DropAttack and various other well-known regularization methods on the generalization performance of the model. We experiment with different neural network models on five different public datasets to prove the effectiveness of the proposed method. We visually analyze the training and verification accuracy of some models of different architectures as the training progresses. And we also analyzed the impact of hyperparameters and optimization of multi-forward-backward propagation on DropAttack. Finally, we also conducted a theoretical analysis of the proposed method from another perspective to prove the effectiveness of DropAttack.
2 Related work
Adversarial training can be traced back to (Goodfellow et al., 2015), in which the model improves the robustness and generalization of the model by generating adversarial examples and injecting them into training data. The effectiveness of adversarial training largely depends on the direction of the attack, so it is necessary to find the perturbation value that maximizes the adversarial loss. Due to its linear characteristics, neural networks are easily attacked by linear perturbation. Therefore, Goodfellow et al. (2015) proposed the Fast Gradient Sign Method (FGSM) to calculate the perturbation of the input sample. They linearized the cost function around the current value of parameters, obtaining an optimal max-norm constrained pertubation of:
(1) |
Where is the model parameter, is the input of the model, y is the label corresponding to the input, is the cost used to train the neural network. sgn is the symbolic function, and is the perturbation coefficient. In order to find a better perturbation, Miyato et al. (2017) proposed the Fast Gradient Method (FGM), which made a simple modification to the calculation of perturbation in FGSM. And FGM is the first time that adversarial training has been applied to text classification tasks. The formula is as follows:
(2) |
Athalye et al. (2018) proposed a confrontation training method called Projected Gradient Descent (PGD), which obtains the final perturbation value through multiple forward and backward propagation iterations, and the perturbation obtained in each iteration is limited to a set range, if it exceeds this range will be mapped to the “sphere” of the range. To put it simply, “walk in small steps, take a few more steps.” The formula is as follows:
(3) |
where is the constraint space of the perturbation, and a is the step length of the “small step”.
After that, although many defenses were broken by Athalye et al. (2018), PGD-based adversarial training is one of the few that can withstand powerful attacks. Athalye et al. (2018) proved that PGD can largely avoid the problem of gradient confusion, but it will still cause high convolution and non-linear loss surface. When K is very small, it will be lightly broken under powerful adversaries. To get more efficient PGD-based adversarial training, it must iteratively calculate the gradient many times, but this will consume a lot of computing resources. Shafahi et al. (2019) proposed a “free” adversarial training algorithm that eliminates the overhead cost of calculating adversarial perturbations by recycling the gradient information computed when updating model parameters. Zhang et al. (2019) effectively reduce the total number of full forward and backward propagations by restricting most of the forward and back propagation within the first layer of the network during adversary updates. Miyato et al. (2019) proposed virtual adversarial training as a regularization method for semi-supervised learning in the text field. Zhu et al. (2020) propose FreeLB for improving the generalization of language models, which performs multiple PGD iterations to attack the embeddings, and simultaneously accumulates the “free” parameter gradients in each iteration.
In this work, we are the first to propose an adversarial training method that simultaneously attacks the input of the model and the weight parameters of other layers to improve the generalization of the model. And our method, DropAttack, is the first to use random masking of some elements to increase the diversity of adversarial attack combinations.
3 The Proposed DropAttack Adversarial Training Method

The proposed new adversarial training method is inspired by Dropout, so we named it DropAttack. Standard adversarial training is to explore the optimal parameters to minimize the maximum risk of adversarial attacks. The Min-Max formula is as follows:
(4) |
where is the data distribution, is the label, and L is loss function. is the perturbation under maximizing internal risk. is the perturbation constraint space. Here we propose a new adversarial training method, DropAttack, which simultaneously attacks the input x of the model and the weight parameters of other layers, and randomly masks some attacks. The overall procedure is shown in Algorithm 1. The Min-Max formula of DropAttack can be expressed as:
(5) |
where and are the perturbation of the input x and the parameter under maximizing the internal risk. We respectively approximate these values by linearizing and around and . Using the linear approximation in equation (6) and the L2 norm constraint, the resulting adversarial perturbation is
(6) |
These perturbation can be easily calculated using backpropagation in a neural network.
and are the Random attack masks of and respectively. For any attack mask, is a matrix of independent Bernoulli random variables with the same dimension as the perturbation value, and the probability of each Bernoulli random variable is 1. Multiplying the perturbation matrix and the attack mask matrix will randomly mask a part of the element values in the perturbation matrix.
The attack on the input in the text model is essentially an attack on the weight parameters of the embedding layer. We can see that Figure 1 (a) is a standard neural network with 12 weight parameters. Figure 1 (b) is an new neural network produced by applying standard adversarial Training on the left, which attacks all inputs. And Figure 1 (c) is an new neural network after DropAttack is applied, and some of the weights are added with the perturbation calculated by the corresponding gradient. Compared with standard adversarial training, DropAttack maximizes internal risk through a wider range of attacks (not limited to the input layer). And randomly mask some dimensional perturbation, which will generate a more robust and diversified embedding space.
3.1 DropAttack with multiple internal ascent steps
Most of the latest adversarial training are PGD-based methods. PGD-based methods are a series of adversarial training algorithms (Kurakin et al., 2017) for solving the maximum-minimum problem of cross-entropy loss, which can be reliably achieved by using multiple projection gradient ascent steps and then performing an SGD (Stochastic Gradient Descent) step. Multiple PGD iterations can get a more optimized perturbation. Obviously, DropAttack can also use multiple forward and backward propagation methods to update the perturbation. In practice, when DropAttack is used, each ascent step of updating perturbation is optimizing for a different weight network. Because and are an iterative relationship, this will affect the inability to obtain the optimal perturbation. Therefore, we use the same mask matrix for each step of perturbation update. The overall procedure is shown in Algorithm 2. We first calculate the initial gradients and of and , and initial perturbation , . Then generate random attack masks and , which are used in the forward and backward propagation for each step of perturbation update. That is, the mask matrix is fixed after the first step of perturbation update. The perturbation values and are updated through the gradient ascend in each iteration, where and . Finally, the model parameter is updated once with the accumulated gradient of each adversarial iteration, the DropAttack-K of K iterations can be expressed as:
(7) |
The training process is equivalent to replacing the original batch with a K-times virtual batch, consisting of samples whose embeddings are . Similarly, multiple virtual neural networks with different weight parameters will be trained, and their weight parameters are respectively. It is worth noting that our perturbation constraint does not take an additional fixed value, and only uses the L2 norm to constrain the gradient value, because we need the diversity of perturbations at each step, rather than forcibly constraining it in a fixed spherical space. In fact, our DropAttack-K also inherits the ”free” ability, using the gradient average calculated by each backpropagation for external minimization.
Intuitively, compared to the previous adversarial training method, DropAttack can generate a richer adversarial sample in the spherical space of the original sample, which can prevent the model from overfitting on the adversarial sample to a certain extent. Empirically, there is a certain gap between the features of the training dataset and the features of the test dataset in the high-dimensional feature space. Improving generalization is essentially to narrow the feature distribution gap between the training dataset and the test dataset. However, this gap is uncertain, so more diverse adversarial samples are needed to fill this gap. In theory, DropAttack has a more significant improvement in the generalization of the model.
4 Experiment
In this section, we test and analyze the effect of DropAttack on three NLP data sets and two CV data sets. In addition, we also analyze the ability of DropAttack to prevent overfitting under different sizes of training data. Additional experimental details and results are provided in the Appendix A.
4.1 Datasets
Five public datasets, IMDB (Maas et al., 2011), PHEME (Zubiaga et al., 2016), AGnews (Zhang et al., 2015), MNIST (LeCun, 1998) and GIFAR-10 (Krizhevsky, 2009), are used to evaluate our DropAttack algorithm. The brief description of the datasets are shown in Table 1. IMDB (Maas et al., 2011) is a standard benchmark movie review dataset for sentiment analysis. PHEME dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The AGnews topic classification dataset is constructed by Xiang Zhang from the original AG news sources. MNIST is s standard and commonly used toy dataset of handwritten digits. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. We divide each dataset into training set, validation set and test set.
Dataset | Task | Classes | Training | Validation | Test |
---|---|---|---|---|---|
IMDB | Sentiment analysis | 2 | 40000 | 5000 | 5000 |
PHEME | Rumor detection | 2 | 5145 | 643 | 637 |
AGnews | News classification | 4 | 110000 | 10000 | 7600 |
MNIST | Image classification | 10 | 50000 | 10000 | 10000 |
CIFAR-10 | Image classification | 10 | 40000 | 10000 | 10000 |
4.2 Experimental Setup
In the experiment, we chose the rnn-based model and the cnn-based model to handle nlp tasks and cv tasks, respectively. For nlp tasks, IMDB uses an LSTM (Hochreiter & Schmidhuber, 1997) (300-300 dim) layer and a fully connected layer (300-2 dim); PHEME uses a BiGRU (Cho et al., 2014) (300-300 dim) layer and a fully connected layer (600-2 dim); AGnews uses two BiLSTM (Schuster & Paliwal, 1997) (300-300 dim) layers and a fully connected layer (600-4 dim). And for cv tasks, MNIST uses the LeNet-5 (LeCun et al., 1998) model, which contains two layers of CNN(1-6-16 channels, kernel size = 5) and three fully connected layers (400-120-84-10 dim); CIFAR-10 uses the VGGNet-16 (Simonyan & Zisserman, 2014) model, which ontains 13 layers of CNN (3-64-64-128-128-256-256-256-512-512-512-512 channels, kernel size = 3) and three fully connected layers (512-4096-4096-10 dim). And all models are implemented based on Pytorch, the Batch sizes value is 128, the optimizer is Adam, and the learning rate is 0.001.
Methods used for comparison:
L1 (Tibshirani, 1996) and L2 (Tikhonov, 1943) regularization: Regular constraints are added to the original objective function, the L1 norm conforms to the Laplace distribution, and the L2 norm conforms to the Gaussian distribution.
Dropout (Srivastava et al., 2014): It is a commonly used regularization method that randomly removes some neural units and all their input and output connections during the training process. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures effciently.
FGSM (Goodfellow et al., 2015): The first adversarial training method uses the sign function to generate the perturbation based on the direction of the gradient ascend.
FGM Miyato et al. (2017): Compared with the FGSM method, FGM improves the calculation method of perturbation. And the perturbation is obtained by dividing the linear approximation of the gradient by the L2 regularization of the gradient.
PGD Athalye et al. (2018): An adversarial training method that requires K forward-backward passes through the network to calculate the optimal perturbation.
FreeAT Shafahi et al. (2019) : A PGD-baesd adversarial training method, which accelerates the training process by sharing the gradient of internal maximization and external minimization.
FreeLB Zhu et al. (2020): A PGD-baesd adversarial training method, which uses the gradient average calculated by K steps to update the model parameters.
4.3 Experimental Results and Discussion
Methods | NLP Datasets | CV Datasets | |||
---|---|---|---|---|---|
IMDB | PHEME* | AGnews | MNIST | CIFAR-10 | |
Original model | 88.12 | 84.08/78.99 | 91.87 | 98.95 | 84.67 |
Original model + L1 | 88.02 | 85.34/79.55 | 92.29 | 99.07 | 84.74 |
Original model + L2 | 88.27 | 85.67/81.29 | 92.43 | 99.14 | 84.63 |
Original model + Dropout | 88.64 | 85.85/81.08 | 92.22 | 99.07 | 85.39 |
Original model + FGSM | 88.04 | 85.61/80.40 | 92.54 | 99.16 | 71.86 |
Original model + FGM | 89.26 | 84.97/78.52 | 92.53 | 99.15 | 85.64 |
Original model + PGD | 89.38 | 85.28/79.30 | 92.76 | 99.10 | 85.57 |
Original model + FreeAT | 89.17 | 85.29/79.32 | 92.45 | 99.09 | 85.45 |
Original model + FreeLB | 89.25 | 85.69/81.18 | 92.58 | 99.11 | 85.47 |
Original model + DropAttack-(I) | 89.76 | 85.75/81.33 | 93.35 | 99.16 | 86.05 |
Original model + DropAttack-(I&W) | 90.36 | 87.15/81.31 | 93.37 | 99.27 | 86.09 |
-
*
Note that although PHEME is a two-class classification, the labels are not balanced, so we use accuracy and f1-score (accuracy/f1) as the evaluation criteria.
We can find from Table 2 that compared to the original model without DropAttack, the performance of the model after using DropAttack has improved on the five datasets, and the improvements on the three nlp datasets are 2.24%, 3.07% and 1.5% respectively. Compared with other regularization methods, the overall effect of adversarial training is better. Among them, the efficiency of FGSM is relatively unstable, and the performance on IMDB and CIFAR-10 is only 88.04% and 71.86% respectively. Because the naive perturbation value may destroy the distribution of the original data. It should be noted that PGD, FreeAT, and FreeLB are all PGD-based adversarial training methods, which require multiple forward and back propagation iterations to calculate the optimal perturbation value. In the experiment, the number of forward-backward passes k is 3. Our method achieves state-of-the-art performance on five datasets by calculating the perturbation value through only one backpropagation. Note that we study the influence of the number of forward-backward propagation on DropAttack later. We can see that the performance of DropAttack-(I&W) is better than DropAttack-(I) on all five datasets, which shows that the weight’s adversarial training is work. In addition, based on the results in Table 2, our method has more improvements on the NLP datasets and less on the CV datasets, which is consistent with the experimental results in other adversarial training research papers (Cheng et al., 2019; Zhao et al., 2018; Zhu et al., 2020). We believe that the reason for this phenomenon is that the perturbation is directly added to the original pixel value for the picture, and the text is not directly modified to the word, the perturbation value is added to the embedding vector. The pixel value of the image is fixed, and the perturbation may change the distribution of the original sample. However, the word vector of the text is not unique and certain, so it is more likely to learn a better word vector after adding perturbation.

In order to show the effect of DropAttack in preventing neural network overfitting more clearly, classification experiments were done with many different models of keeping all hyperparameters, including e and p, fixed. As shown in Figure 2, Training and validation accuracy obtained for these models of different architectures (TextRNN, TextCNN and TextRCNN) as training progresses. The training accuracy under DropAttack training is basically the same as that under standard training, but the validation accuracy is higher, which proves that using the DropAttack adversarial training method can alleviate model overfitting. Furthermore, we can see that DropAttack adversarial training may be more difficult to converge in the early stage, because it is more difficult to optimize the target under attack, but after enough weight updates and learning, the validation accuracy of the model is relatively more stable. The key point is that DropAttack gives a obvious improvement across all nerual networks of different architectures, without using hyperparameters that were tuned specififically for each architecture.
In addition, we also study the efficiency of DropAttack under training datasets of different sizes. We divide the IMDB training set into different sizes, and use the LSTM model with the same structure as the above, and the experimental results are shown in Table 3. Compared with standard training, the performance of the model using the DropAttack training method on all seven training datasets of different sizes has been improved by more than 2%. Furthermore, we found that DropAttack can achieve and even exceed the accuracy of standard training using only half of the training data. For example, DropAttack can achieve 83.88% using 2500 training data, and the accuracy of using 5000 data based on standard training is 82.62%.
Methods | Size of the training set | |||||||
---|---|---|---|---|---|---|---|---|
100 | 500 | 1000 | 2500 | 5000 | 10000 | 20000 | 40000 | |
Standard Training | 63.26 | 74.26 | 78.30 | 81.14 | 82.62 | 84.92 | 85.42 | 88.12 |
DropAttack-3 Training | 65.46 | 76.34 | 80.70 | 83.88 | 85.22 | 87.02 | 88.86 | 90.42 |
Improvement | 2.20 | 2.08 | 2.40 | 2.66 | 2.60 | 2.10 | 3.42 | 2.30 |
Methods | IMDB | PHEME | AGnews | MNIST | CIFAR-10 |
---|---|---|---|---|---|
DropAttack-1 | 90.36 | 87.15/82.31 | 93.37 | 99.27 | 86.09 |
DropAttack-2 | 90.38 | 87.27/82.43 | 93.38 | 99.24 | 86.09 |
DropAttack-3 | 90.42 | 87.36/82.78 | 93.34 | 99.27 | 86.07 |
DropAttack-4 | 90.43 | 87.25/82.63 | 93.41 | 99.26 | 86.10 |
DropAttack-5 | 90.42 | 87.26/82.67 | 93.39 | 99.25 | 86.09 |
PGD-based DropAttack-k. we study the influence of the number of forward-backward propagation on DropAttack, and the experimental results are shown in Table 4. We can find that multiple iterative calculations can indeed further improve the generalization of the neural network, because multiple iterations are more likely to get the optimal disturbance value. However, it is clear that more forward-backward propagation will greatly increase the training time. Therefore, a reasonable number of iterations K can be selected based on time and computing resources.
5 Theoretical Analysis
We provide another theoretical perspectives to explain why the adversarial training method DropAttack can be used as regularization to improve the generalization of the model and prevent overfitting. According to Section 3, the task of DropAttack can be to minimize the maximum internal adversarial risk, that is, to approximately optimize the following goals:
(8) |
For formula 8, we Taylor expand functions and at points and respectively:
(9) |
Then, after substituting the values of the perturbation that maximize the antagonistic Loss, the following formula 10 is obtained:
(10) |
(11) |
(12) |
(13) |
We can see the final optimization function formula 13, which actually adds the implicit gradient regularization and to a certain proportion of input and parameters after loss every time the parameter is updated. Gradient penalty pushes the gradient of some parameters and inputs to approach zero, so that the model is likely to be optimized to a flatter minimum.
In order to further visually analyze the effectiveness of the proposed method, we draw the high-dimensional non-convex loss function with a visualization method proposed by (Li et al., 2018). We visualize the loss landscapes around the minima of the empirical risk generated by standard training or DropAttack, the 2D visualization are plotted in Figure 3 and the 3D visualization are in Figure 4. Additional loss visualization are provided in the Appendix. Define two direction vectors, and with the same dimensions as , drawn from a Gaussian distribution with zero mean and a scale of the same order of magnitude as the variance of layer weights. Then we choose a center point and add a linear combination of and to obtain a loss that is a function of the contribution of the two random direction vectors. And we define a grid of points to evaluate the loss on i.e. range of values for and for which is evaluated and stored.
The results show that the test loss becomes lower and flatter during the training with DropAttack. And DropAttack indeed selects flatter loss landscapes via masked adversarial perturbations. Many studies have shown that a flatter loss landscape usually means better generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2019; Ishida et al., 2020).


6 Conclusion
In this work, we propose a masked weight adversarial training method, DropAttack, to improve the generalization ability of neural network models and prevent overfitting. The proposed algorithm uses the gradient method to attack the input and weight parameters according to a certain probability, and enhances the generalization of the model by minimizing the resultant adversarial risk. Experimental results prove that DropAttack can effectively improve the generalization of models and prevent overfitting, especially in the field of NLP. In addition, we theoretically proved that our algorithm can regularize the gradient of model parameters. Therefore, DropAttack can improve the robustness and generalization of the model. The current adversarial training still consumes more computing resources and time than the standard stochastic gradient descent, so it is a valuable research direction to accelerate the adversarial training while improving generalization in the future.
References
- Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. PMLR, 2018.
- Cheng et al. (2019) Yong Cheng, Lu Jiang, and Wolfgang Macherey. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4324–4333, 2019.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Ishida et al. (2020) Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi Sugiyama. Do we need zero training loss after achieving zero training error? In International Conference on Machine Learning, pp. 4604–4614. PMLR, 2020.
- Keskar et al. (2019) Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail Smelyanskiy. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, 2019.
- Krizhevsky (2009) A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
- Kurakin et al. (2017) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ICLR, 2017.
- LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6391–6401, 2018.
- Maas et al. (2011) Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150, 2011.
- Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. ICLR, 2017.
- Miyato et al. (2019) Takeru Miyato, Shin-Ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2019.
- Morgan & Bourlard (1989) Nelson Morgan and Hervé Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. Advances in neural information processing systems, 2:630–637, 1989.
- Schuster & Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
- Shafahi et al. (2019) Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3358–3369, 2019.
- Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Tikhonov (1943) Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp. 195–198, 1943.
- Zhang et al. (2019) Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32:227–238, 2019.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28:649–657, 2015.
- Zhao et al. (2018) Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations, 2018.
- Zhu et al. (2020) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. In ICLR, 2020.
- Zubiaga et al. (2016) Arkaitz Zubiaga, Maria Liakata, and Rob Procter. Learning reporting dynamics during breaking news for rumour detection in social media. arXiv preprint arXiv:1610.07363, 2016.
Appendix A Additional Experimental Details and Results
In order to prove the effectiveness of DropAttack, we conducted many experiments on five datasets: IMDB, PHEME, AGnews, MNIST, CIFAR-10. The detailed experimental settings and results are shown in Table 6, Table 7, Table 8, Table 9, Table 10. The square brackets after DropAttack indicate the object of perturbation. For example, DropAttack[Embedding, Lstm.ih.w] means adding perturbation to Embedding and Lstm.ih.w parameters.
Method | Accuracy (%) |
---|---|
LSTM | 88.12 |
LSTM + DropAttack[Embedding] ( e = 5, p = 0.5 ) | 89.76 |
LSTM + DropAttack[Embedding, Fc.w] ( e = 5, p = 0.5 ) | 88.60 |
LSTM + DropAttack[Lstm.hh.w] ( e = 5, p = 0.5 ) | 86.64 |
LSTM + DropAttack[Lstm.ih.w] ( e = 5, p = 0.5 ) | 89.80 |
LSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.5 ) | 87.90 |
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.5 ) | 90.21 |
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 3, p = 0.7 ) | 90.22 |
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.7 ) | 90.34 |
LSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 7, p = 0.7 ) | 90.36 |
LSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.5 ) | 89.56 |
LSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.6 ) | 89.57 |
Method | Accuracy/F1 score (%) |
---|---|
BiLSTM | 84.08/78.99 |
BiLSTM + DropAttack[Embedding] ( e = 5, p = 0.5 ) | 85.69/79.97 |
BiLSTM + DropAttack[Lstm.hh.w] ( e = 5, p = 0.5 ) | 86.00/80.03 |
BiLSTM + DropAttack[Lstm.ih.w] ( e = 5, p = 0.5 ) | 83.40/78.64 |
BiLSTM + DropAttack[Fc.w] ( e = 5, p = 0.5 ) | 84.60/79.32 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.5 ) | 87.14/81.02 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.6 ) | 87.14/81.04 |
BiLSTM + DropAttack[Embedding, Lstm.ih.w] ( e = 5, p = 0.5 ) | 86.74/80.13 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.7 ) | 87.15/81.31 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w] ( e = 5, p = 0.8 ) | 87.11/81.24 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.5 ) | 85.54/79.57 |
BiLSTM + DropAttack[Embedding, Lstm.hh.w, Lstm.ih.w] ( e = 5, p = 0.7 ) | 85.35/79.36 |
Method | Accuracy (%) |
---|---|
BiGRU | 91.87 |
BiGRU+ DropAttack[Embedding] ( e = 5, p = 0.5 ) | 93.35 |
BiGRU+ DropAttack[Embedding] ( e = 5, p = 0.7 ) | 93.34 |
BiGRU + DropAttack[Gru.hh.w] ( e = 5, p = 0.5 ) | 92.25 |
BiGRU + DropAttack[Gru.ih.w] ( e = 5, p = 0.5 ) | 92.46 |
BiGRU + DropAttack[Gru.hh.w, Gru.ih.w] ( e = 5, p = 0.5 ) | 92.70 |
BiGRU + DropAttack[Fc.w] ( e = 5, p = 0.5 ) | 92.24 |
BiGRU + DropAttack[Embedding, Gru.ih.w] ( e = 5, p = 0.5 ) | 93.12 |
BiGRU + DropAttack[Embedding, Gru.ih.w] ( e = 5, p = 0.7 ) | 93.37 |
BiGRU + DropAttack[Embedding, Gru.hh.w] ( e = 5, p = 0.5 ) | 92.70 |
BiGRU + DropAttack[Embedding, Gru.hh.w, Gru.ih.w] ( e = 5, p = 0.5 ) | 93.12 |
BiGRU + DropAttack[Embedding, Gru.ih.w, Gru.ih.w.reverse] ( e = 5, p = 0.5 ) | 92.88 |
Method | Accuracy (%) |
---|---|
LeNet-5 | 98.95 |
LeNet-5 + DropAttack[Input] ( e = 5, p = 0.5 ) | 99.16 |
LeNet-5 + DropAttack[Conv.1.w] ( e = 5, p = 0.5 ) | 99.08 |
LeNet-5 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.5 ) | 99.27 |
LeNet-5 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.7 ) | 99.25 |
LeNet-5 + DropAttack[Input, Conv.2.w] ( e = 5, p = 0.5 ) | 99.12 |
LeNet-5 + DropAttack[Input, Conv.1.b] ( e = 5, p = 0.5 ) | 99.10 |
LeNet-5 + DropAttack[Conv.2.w] ( e = 5, p = 0.5 ) | 99.11 |
LeNet-5 + DropAttack[Conv.1.b, Conv.2.w] ( e = 5, p = 0.5 ) | 99.09 |
LeNet-5 + DropAttack[Conv.1.w, Conv.2.w] ( e = 5, p = 0.5 ) | 98.78 |
LeNet-5 + DropAttack[Conv.1.w, Fc.1] ( e = 5, p = 0.5 ) | 98.93 |
LeNet-5 + DropAttack[Conv.2.w, Fc.1] ( e = 5, p = 0.5 ) | 99.10 |
LeNet-5 + DropAttack[Conv.2.w, Fc.2] ( e = 5, p = 0.5 ) | 99.05 |
LeNet-5 + DropAttack[Conv1.w, Conv2.w, fc1.w2] ( e = 5, p = 0.5 ) | 98.48 |
Method | Accuracy (%) |
---|---|
VGGNet-16 | 84.67 |
VGGNet-16 + DropAttack[Input] ( e = 5, p = 0.5 ) | 86.02 |
VGGNet-16 + DropAttack[Conv.1.w] ( e = 5, p = 0.5 ) | 83.61 |
VGGNet-16 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.5 ) | 86.09 |
VGGNet-16 + DropAttack[Input, Conv.1.w] ( e = 5, p = 0.7 ) | 86.02 |
VGGNet-16 + DropAttack[Input, Conv.3.w] ( e = 5, p = 0.7 ) | 85.13 |
VGGNet-16 + DropAttack[Input, Conv.5.w] ( e = 5, p = 0.7 ) | 85.13 |
VGGNet-16 + DropAttack[BatchNorm.1.w] ( e = 5, p = 0.5 ) | 85.51 |
VGGNet-16 + DropAttack[Conv.2.w] ( e = 5, p = 0.5 ) | 83.16 |
VGGNet-16 + DropAttack[Conv.6.w] ( e = 5, p = 0.5 ) | 85.27 |
VGGNet-16 + DropAttack[BatchNorm.8.w] ( e = 5, p = 0.5 ) | 85.32 |
VGGNet-16 + DropAttack[Conv.1.w, BatchNorm.1.w] ( e = 5, p = 0.5 ) | 84.41 |
VGGNet-16 + DropAttack[Input, Conv.1.w, BatchNorm.1.w] ( e = 5, p = 0.5 ) | 85.01 |
For an important discussion and research question: In addition to the perturbation of the input layer, which layer’s weight parameters are better to perturb? According to our experimental results and experience, for the perturbation of hidden layer parameters, the effect of being close to the input layer will be better than that of being close to the output layer. Essentially, the perturbation of the hidden layer is the perturbation of the higher-dimensional embedding of the input. Due to the highly linearization of neural networks, small changes in the input vector may result in changes in the output of multiple layers. Therefore, the deeper weight parameters need to be robust enough to resist overfitting and prevent fitting to overly sensitive input features.
Appendix B Hyperparameter sensitivity analysis
Perturbation coefficient * | |||||||||
---|---|---|---|---|---|---|---|---|---|
= 0.01 | = 0.1 | = 1 | = 3 | = 5 | = 7 | = 9 | |||
Attack probability* |
|
88.12 | 88.12 | 88.12 | 88.12 | 88.12 | 88.12 | 88.12 | |
P = 0.1 | 89.60 | 89.78 | 89.40 | 88.74 | 89.30 | 88.88 | 89.26 | ||
P = 0.3 | 89.98 | 90.12 | 90.02 | 90.10 | 90.04 | 90.28 | 89.74 | ||
P = 0.5 | 90.13 | 90.16 | 90.01 | 90.25 | 90.21 | 90.30 | 89.94 | ||
P = 0.7 | 90.17 | 90.32 | 90.18 | 90.20 | 90.18 | 90.36 | 90.14 | ||
P = 0.9 | 89.76 | 90.22 | 90.16 | 89.22 | 90.12 | 90.04 | 90.18 | ||
|
89.86 | 89.54 | 89.74 | 89.40 | 90.02 | 89.90 | 90.10 |
-
*
Note that and are uniformly denoted by ; and are uniformly denoted by .
DropAttack has three tunable hyperparameter perturbation coefficient , attack probability p (the probability of attacking a weight param in the network) and number of forward-backward propagation K. We explore the effect of varying these hyperparameter. Firstly, We fixed the value of K to 1, and let take the values in [0.001, 0.1, 1, 3, 5, 7, 9] in turn, and p take the values in [0, 0.1, 0.3, 0.5, 0.7, 0.9, 1] in turn, so that there is a total of 7 x 7 = 49 kinds of hyperparameter combinations, and the experimental results are shown in Table 10. The best performance is 90.36% when and p = 0.7. When p = 0, it means that the model is standard training without any attack, and its accuracy is the lowest. And we found that the effect is significantly worse when p is less than 0.3 or greater than 0.9. When p =1, random masking is not used, although the performance is improved but not much, because the attack combination is relatively single. Through experimental results, we think that the hyperparameter p is best to be between 0.5 and 0.7.
As shown in Figure 5, we study the impact of different attack probabilities on model performance under different perturbation coefficient coefficients. We can see that the attack probability ranges from 0 to 0.7, and the performance of the model increases as the attack probability increases, because the intensity of the model attack is getting stronger. However, we found that the attack probability increased from 0.7 to 1, and our performance decreased instead. Because an excessively high attack probability will reduce the diversity of attack combinations, when p=0, it is approximately a standard adversarial attack.

Appendix C Additional Loss Visualization
We visualize the test loss function landscapes of the standard training and DropAttack adversarial training models separately. The 2D and 3D visualization results are shown in Figure 6 and Figure 7, respectively. The structure and parameters of the models are derived from Section 4.2.

