GNP Attack: Transferable Adversarial Examples via Gradient Norm Penalty
Abstract
Adversarial examples (AE) with good transferability enable practical black-box attacks on diverse target models, where insider knowledge about the target models is not required. Previous methods often generate AE with no or very limited transferability; that is, they easily overfit to the particular architecture and feature representation of the source, white-box model and the generated AE barely work for target, black-box models. In this paper, we propose a novel approach to enhance AE transferability using Gradient Norm Penalty (GNP). It drives the loss function optimization procedure to converge to a flat region of local optima in the loss landscape. By attacking 11 state-of-the-art (SOTA) deep learning models and 6 advanced defense methods, we empirically show that GNP is very effective in generating AE with high transferability. We also demonstrate that it is very flexible in that it can be easily integrated with other gradient based methods for stronger transfer-based attacks.
Index Terms— Adversarial machine learning, Transferability, Deep neural networks, Input gradient regularization
1 Introduction
Deep Neural Networks (DNNs) are the workhorse of a broad variety of computer vision tasks but are vulnerable to adversarial examples (AE), which are data samples (typically images) that are perturbed by human-imperceptible noises yet result in odd misclassifications. This lack of adversarial robustness curtails and often even prevents deep learning models from being deployed in security or safety critical domains such as healthcare, neuroscience, finance, and self-driving cars, to name a few.
Adversarial examples are commonly studied under two settings, white-box and black-box attacks. In the white-box setting, adversaries have full knowledge of victim models, including model structures, parameters and weights, and loss functions used to train the models. Therefore, they can directly obtain the gradients of the victim models and seek adversarial examples by misleading the loss function toward incorrect predictions. White-box attacks are important for evaluating and developing robust models and serve as the backend method for many black-box attacks, but is limited in use due to its requirement of having to know the internal details of target models. In the black-box setting, adversaries do not need specific knowledge about victim models other than their external properties (type of input and output). Two types of approaches, query-based and transfer-based, are commonly studied for black-box attacks. The query-based approach attempts to estimate the gradients of a victim model by querying it with a large number of input samples and inspecting the outputs. Due to the large number of queries, it can be easily detected and defended. The transfer-based approach uses surrogate models to generate transferable AE which can attack a range of models instead of a single victim model. Hence it is a more attractive approach to black-box attacks.
This paper takes the second approach and focuses on designing a new and effective method to improve the transferability of AE. Several directions for boosting adversarial transferability have appeared. Dong et al. [1] proposed momentum based methods. Attention-guided transfer attack (ATA) [2] uses attention maps to identify common features for attacking. Diverse Input Method (DIM) [3] calculates the average gradients of augmented images. [4] generates transferable AE using an ensemble of multiple models.
Despite the efforts of previous works, there still exists a large gap of attack success rate between the transfer-based setting and the ideal white-box setting. In this paper, we propose a novel method to boost adversarial transferability from an optimization perspective. Inspired by the concept of “flat minima” in the optimization theory [5] which improves the generalization of DNNs, we seek to generate AE that lie in flat regions where the input gradient norm is small, so as to “generalize” to other victim models that AE are not generated on. In a nutshell, this work makes the following contributions:
-
•
We propose a transfer-based black-box attack from a new perspective that seeks AE in a flat region of loss landscape by penalizing the input gradient norm.
-
•
We show that our method, input gradient norm penalty (GNP), can significantly boost the adversarial transferability for a wide range of deep networks.
-
•
We demonstrate that GNP can be easily integrated with existing transfer-based attacks to produce even better performance, indicating a highly desirable flexibility.

2 Method
Given a classification model that outputs a label as the prediction for an input , we aim to craft an adversarial example which is visually indistinguishable from but will be misclassified by the classifier, i.e., . The generation of AE can be formulated as the following optimization problem:
(1) |
where the loss function is often the cross-entropy loss, and the -norm measures the discrepancy between and . In this work, we use which is commonly adopted in the literature. Optimizing Eq. (1) needs to calculate the gradient of the loss function, but this is not feasible in the black-box setting. Therefore, we aim to create transferable AE on a source model yet can attack many other target models.
We develop a new method to boost adversarial transferability from a perspective inspired by “flat optima” in optimization theory. See Fig. 1. If an AE is located at a sharp local maximum, it will be sensitive to the difference of decision boundaries between the source model and target models. In contrast, if it is located at a flat maximum region, it is much more likely to result in a similar high loss on other models (which is desired).
Thus, we seek to generate AE in flat regions. To this end, we introduce a gradient norm penalty (GNP) term into the loss function, which penalizes the gradient norm of the loss function with respect to input. The reason is that flat regions are characterized by small gradient norms, hence penalizing the gradient norm will encourage the optimizer to find an AE that lies in a flat region. We thus enhance the adversarial transferability since a minor shift of decision boundary will not significantly change the loss value (prior work has shown that different networks often share similar decision boundaries).
Input: A clean sample with ground-truth label ; source model with loss function ;
Input: Perturbation size ; maximum iterations ; step length ; regularization coefficient
Output: A transferable AE
2.1 Baseline Attacks
GNP is a very flexible method in that it can be easily incorporated into any existing gradient based method to boost its strength. We consider the following existing, gradient based attacks to demonstrate the effect of GNP. Later in Section 3, we will also show how GNP works effectively on state-of-the-art transfer-based attacks as well.
Fast Gradient Sign Method (FGSM). FGSM [6] is the first gradient-based attack which crafts an AE by attempting to maximize the loss function with a one-step update:
(2) |
where is the gradient of loss function with respect to , and denotes the sign function.
Iterative Fast Gradient Sign Method (I-FGSM). I-FGSM extends FGSM to an iterative version:
(3) | ||||
where is a small step size and is the number of iterations.
Method | ResNet50* | VGG19 | ResNet152 | Inc v3 | DenseNet | MobileNet | SENet | ResNeXt | WRN | PNASNet | MNASNet | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I-FGSM | 16/255 | 100.00% | 61.50% | 52.82% | 30.86% | 57.36% | 58.92% | 38.12% | 48.88% | 48.92% | 28.92% | 57.20% | 48.35% |
8/255 | 100.00% | 38.90% | 29.36% | 15.36% | 34.86% | 37.66% | 17.76% | 26.30% | 26.26% | 13.04% | 35.08% | 27.46% | |
4/255 | 100.00% | 18.86% | 11.28% | 6.66% | 15.44% | 18.36% | 5.72% | 9.58% | 9.98% | 4.14% | 17.02% | 11.70% | |
I-FGSM+GNP | 16/255 | 100.00% | 75.96% | 68.89% | 48.23% | 73.68% | 74.05% | 55.46% | 62.36% | 70.60% | 45.06% | 76.98% | 67.12% |
8/255 | 99.96% | 68.56% | 60.65% | 38.58% | 62.05% | 63.23% | 43.69% | 50.36% | 59.32% | 33.62% | 60.28% | 53.97% | |
4/255 | 99.98% | 25.96% | 22.35% | 15.86% | 26.89% | 28.66% | 15.62% | 21.93% | 23.06% | 13.69% | 30.21% | 22.38% | |
MI-FGSM | 16/255 | 100.00% | 73.01% | 67.62% | 47.51% | 73.16% | 72.42% | 54.53% | 61.78% | 60.96% | 44.10% | 71.46% | 62.75% |
8/255 | 100.00% | 52.50% | 41.52% | 25.56% | 47.25% | 48.96% | 28.06% | 35.81% | 37.56% | 20.41% | 47.62% | 38.53% | |
4/255 | 99.94% | 25.74% | 16.68% | 9.95% | 22.54% | 24.89% | 9.56% | 14.20% | 15.38% | 7.23% | 23.27% | 16.94% | |
MI-FGSM+GNP | 16/255 | 100% | 89.65% | 83.69% | 65.86% | 87.96% | 90.06% | 69.74% | 79.12% | 77.36% | 58.60% | 88.25% | 79.04% |
8/255 | 99.91% | 65.28% | 55.63% | 39.69% | 61.42% | 63.26% | 42.03% | 48.65% | 51.07% | 35.03% | 58.93% | 52.20% | |
4/255 | 100.00% | 39.62% | 33.25% | 15.62% | 37.96% | 40.04% | 20.35% | 30.27% | 30.05% | 15.23% | 37.92% | 30.03% |
Momentum Iterative Fast Gradient Sign Method (MI-FGSM). MI-FGSM [1] integrates a momentum term into I-FGSM and improves transferability by a large margin:
(4) | ||||
where and is a decay factor.
2.2 GNP Attack
As explained in Section 2, we aim to guide the loss function optimization process to move into a flat local optimal region. To this end, we introduce GNP to penalize large gradient norm, as
(5) |
where is the original loss function of the source model, and the regularization term is our GNP, which encourages small gradient norm when finding local maxima.
For gradient based attacks (e.g., FGSM, I-FGSM, MI-FGSM, etc.), we need to calculate the gradient of the new loss (5). To simplify notation, we omit in the loss function since we are calculating gradient with respect to . Using the chain rule, we have
(6) |
This equation involves the calculation of Hessian matrix . This is often infeasible because of the curse of dimensionality (such a Hessian matrix in DNNs tends to be too large due to the often large input dimension). Therefore, we take the first-order Taylor expansion together with the finite difference method (FDM) to approximate the following gradient:
(7) |
where , and is the step length to control the neighborhood size. Thus we obtain the regularization term of (6) as:
(8) |
Inserting (8) back into (6), we obtain the gradient of the regularized loss function as:
(9) |
where is the regularization coefficient. We summarize the algorithm of how GNP is integrated into I-FGSM in Algorithm 1, but I-FGSM can be replaced by any gradient based attack.
Method | ResNet50* | VGG19 | ResNet152 | Inc v3 | DenseNet | MobileNet v2 | SENet | ResNeXt | WRN | PNASNet | MNASNet | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DIM | 16/255 | 100.00% | 93.70% | 93.62% | 72.96% | 94.32% | 91.68% | 79.41% | 91.65% | 91.17% | 76.34% | 89.07% | 87.47% |
8/255 | 100.00% | 74.01% | 71.32% | 40.58% | 74.65% | 71.63% | 44.32% | 63.38% | 64.32% | 40.29% | 67.27% | 61.28% | |
4/255 | 100.00 % | 39.21% | 31.65% | 15.93% | 38.35% | 36.74% | 15.42% | 25.53% | 28.68% | 12.40% | 33.56% | 27.76% | |
DIM+GNP | 16/255 | 100.00% | 96.49% | 97.38% | 76.89% | 97.86% | 95.73% | 84.56% | 95.38% | 96.04% | 81.69% | 93.51% | 91.95% |
8/255 | 100.00% | 85.63% | 84.21% | 49.65% | 85.32% | 80.59% | 56.24% | 72.39% | 75.52% | 51.68% | 78.16% | 72.24% | |
4/255 | 100.00% | 51.36% | 45.69% | 27.96% | 51.39% | 49.29% | 28.13% | 40.08% | 39.64% | 25.97% | 45.23% | 40.96% | |
TIM | 16/255 | 100.00% | 79.90% | 76.28% | 54.41% | 85.42% | 77.68% | 55.02% | 74.15% | 73.86% | 62.07% | 74.38% | 73.34% |
8/255 | 100.00% | 54.91% | 44.76% | 28.29% | 58.17% | 51.02% | 24.16% | 41.70% | 46.08% | 29.05% | 48.92% | 41.71% | |
4/255 | 99.92% | 24.31% | 17.23% | 12.67% | 28.42% | 23.24% | 6.56% | 15.03% | 18.25% | 9.94% | 22.76% | 18.95% | |
TIM+GNP | 16/255 | 100.00% | 93.61% | 90.39% | 68.43% | 96.89% | 91.23% | 69.01% | 87.32% | 84.69% | 76.25% | 85.39% | 84.30% |
8/255 | 100.00% | 70.03% | 61.29% | 45.12% | 71.35% | 66.23% | 41.03% | 55.46% | 60.12% | 46.20% | 62.97% | 57.99% | |
4/255 | 100.00% | 35.96% | 35.03% | 25.16% | 43.17% | 36.95% | 20.36% | 30.31% | 32.01% | 23.68% | 39.05% | 32.27% |
3 Experiments
3.1 Experiment Setup
Dataset and models. We randomly sample 5,000 test images that can be correctly classified by all the models, from the ImageNet [7] validation set. We consider 11 SOTA DNN-based image classifiers: ResNet50 [8], VGG-19 [9], ResNet-152 [8], Inc v3 [10], DenseNet [11], MobileNet v2 [12], SENet [13], ResNeXt [14], WRN [15], PNASNet [16], and MNASNet [17]. Following the work in [18], we choose ResNet50 as the source model and the remining 10 models as target models.
Implementation Details. In experiments, the pixel values of all images are scaled to [0, 1]. The adversarial perturbation is restricted by 3 scales . The step length is set as and regularization coefficient , we run 100 iterations for all attacks and evaluate model misclassification as attack success rate.
Source model | Attack | Inc-v3ens3 | Inc-v3ens4 | IncRes-v2ens1 | JPEG | R&P | NRP |
---|---|---|---|---|---|---|---|
ResNet50 | DIM+TIM | 52.13% | 48.79% | 38.96% | 54.85% | 49.75% | 39.44% |
DIM+TIM+GNP | 65.69% | 63.16% | 52.89% | 66.31% | 62.04% | 52.81% |
3.2 Experimental Results
3.2.1 Integration with baseline attacks
We first evaluate the performance of GNP by integrating it with baseline attacks including I-FGSM and MI-FGSM. The results are shown in Table 1. We use a pre-trained ResNet50 as the source model and evaluate the attack success rate (ASR) of the generated AE on a variety of target models under different scales of perturbation . GNP achieves significant and consistent improvement in all the cases. For instance, taking the average ASR of all the 10 target models under perturbation , GNP outperforms I-FGSM and MI-FGSM by 26.51% and 13.67%, respectively. In addition, the improvements of the attack success rates on a single model can be achieved by a large margin of 33.06%.
3.2.2 Integration with existing transfer-based attacks
Here we also evaluate the effectiveness of GNP when incorporated into other transfer-based attacks such as DIM [19] and TIM [3]. The results are given in Table 2 and show that DIM+GNP and TIM+GNP are clear winners over DIM and TIM alone, respectively. Specifically, DIM+GNP achieves an average success rate of 91.95% under for the 10 target models, and TIM+GNP outperform TIM by a large margin of 16.28% under . We note that we only present the integration of GNP with two typical methods here, but our method also apply to other more powerful gradient-based attack methods.
3.2.3 Attacking “secured” models
For a more thorough evaluation, we also investigate how GNP will perform when attacking DNN models that have been adversarially trained (and hence are much harder to attack). We choose three such advanced defense methods to attack, namely, JPEG [20], R&P [21] and NRP [22]. In addition, we choose another three ensemble adversarially trained (AT) models, which are even harder than regular AT models, and attack them: Inc-v3ens3, Inc-v3ens4 and IncRes-v2ens1 [23]. We craft AE on the ResNet50 surrogate model with , and use DIM+TIM as the “backbone” to apply GNP. The results are presented in Table 3, where we can see that GNP again boosts ASR significantly against the six “secured” models, achieving consistent performance improvements of 11.46–14.37%.

3.3 Ablation Study
We conduct ablation study on the hyper-parameters of the proposed GNP attack, i.e., step length and regularization coefficient . Since represents the radius of neighborhood that is flat around current AE, a larger is preferred; on the other hand, setting it too large will increase the approximation error of Taylor expansion and thus mislead the AE update direction. The is to balance the goal of fooling the surrogate model and finding flat optima. Fig. 2 reports the results of our ablation study, where ASR is averaged over 10 target models (excluding the source ResNet50) attacked by I-FGSM + GNP with . We observe that adding the GNP regularization term clearly improves performance (as compared to ) and the performance gain is rather consistent for in a wide range of 0.6–1.6. The step length does not affect the performance gain too much either, and seems to be the most stable. Thus, the ablation study reveals that GNP is not hyper-parameter sensitive and works well in a variety of conditions.
4 Conclusion
In this paper, we have proposed a new method for improving the transferability of AE from an optimization perspective, by seeking AE located at flat optima. We achieve this by introducing an input gradient norm penalty (GNP) which guides the AE search toward flat regions of the loss function. This GNP method is very flexible as it can be used with any gradient based AE generation methods. We conduct comprehensive experimental study and demonstrate that our method can boost the transferability of AE significantly.
This paper focuses on untargeted attacks, but GNP can be rather easily applied to targeted attacks as well, by making a small change to the loss function. We plan to have a thorough investigation in future work.
References
- [1] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193.
- [2] Weibin Wu, Yuxin Su, Xixian Chen, Shenglin Zhao, Irwin King, Michael R Lyu, and Yu-Wing Tai, “Boosting the transferability of adversarial samples via attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1161–1170.
- [3] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille, “Improving transferability of adversarial examples with input diversity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2730–2739.
- [4] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song, “Delving into transferable adversarial examples and black-box attacks,” International Conference on Learning Representations, 2017.
- [5] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in International Conference on Learning Representations, 2021.
- [6] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” International Conference on Learning Representations, 2015.
- [7] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
- [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- [9] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- [10] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
- [11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- [12] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018.
- [13] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
- [14] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in CVPR, 2017.
- [15] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” in BMVC, 2016.
- [16] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy, “Progressive neural architecture search,” in ECCV, 2018.
- [17] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in CVPR, 2019.
- [18] Qizhang Li, Yiwen Guo, and Hao Chen, “Yet another intermediate-leve attack,” in ECCV, 2020.
- [19] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu, “Evading defenses to transferable adversarial examples by translation-invariant attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4312–4321.
- [20] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens Van Der Maaten, “Countering adversarial images using input transformations,” International Conference on Learning Representations, 2018.
- [21] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille, “Mitigating adversarial effects through randomization,” International Conference on Learning Representations, 2018.
- [22] Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli, “A self-supervised approach for adversarial robustness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 262–271.
- [23] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel, “Ensemble adversarial training: Attacks and defenses,” International Conference on Learning Representations, 2018.