Vanilla Feature Distillation for Improving the Accuracy-Robustness Trade-Off in
Adversarial Training

Guodong Cao ¹ Zhibo Wang ^1,2 Xiaowei Dong ¹ Zhifei Zhang ³ Hengchang Guo ¹ Zhan Qin ² Kui Ren ²

¹ School of Cyber Science and Engineering, Wuhan University, P. R. China.
² School of Cyber Science and Technology, Zhejiang University, P. R. China
³ Adobe Research

Abstract

Adversarial training has been widely explored for mitigating attacks against deep models. However, most existing works are still trapped in the dilemma between higher accuracy and stronger robustness since they tend to fit a model towards robust features (not easily tampered with by adversaries) while ignoring those non-robust but highly predictive features. To achieve a better robustness-accuracy trade-off, we propose the Vanilla Feature Distillation Adversarial Training (VFD-Adv), which conducts knowledge distillation from a pre-trained model (optimized towards high accuracy) to guide adversarial training towards higher accuracy, i.e., preserving those non-robust but predictive features. More specifically, both adversarial examples and their clean counterparts are forced to be aligned in the feature space by distilling predictive representations from the pre-trained/clean model, while previous works barely utilize predictive features from clean models. Therefore, the adversarial training model is updated towards maximally preserving the accuracy as gaining robustness. A key advantage of our method is that it can be universally adapted to and boost existing works. Exhaustive experiments on various datasets, classification models, and adversarial training algorithms demonstrate the effectiveness of our proposed method.

1 Introduction

Deep neural networks (DNNs) have widely dominated various tasks, e.g., image classification[11, 16], object detection[22, 26], autonomous driving[18], etc. However, DNNs are known to be vulnerable to adversarial examples generated by overlaying carefully designed perturbations onto original/clean examples[10, 17, 20]. A. Ilyas et al.[13] gave an insight to adversarial examples, i.e., rather than “bugs”, one explanation is that, adversarial examples arise as a result of non-robust features tempered by small perturbations which indicate a wrong class.

Against those adversarial attacks, adversarial training is explored to improve the robustness of DNNs, typically by feeding adversarial examples to the model during the training stage[1, 4, 9]. Generally speaking, adversarial training is formulated as a min-max optimization problem, where perturbations are generated to maximize the original loss, and then the model is optimized against the perturbations/attacks by minimizing the loss. A model will become immune to adversarial attacks by forcing the model to focus on robust features, which would robustly correlate true labels to adversarial examples. Although these methods have successfully improved the robustness of deep models, they are mostly trapped in the dilemma between gaining robustness and preserving high accuracy. Recent studies[5, 7, 25] found that knowledge distillation can further improve adversarial training. By resorting to third-party models, these methods would learn a more reasonable decision boundary, achieving higher accuracy. However, due to greedily fitting a model towards adversarial examples, they ignored predictive features from clean examples thus limiting model accuracy. Therefore, it is imperative to explore the predictive features to mitigate accuracy degradation in adversarial training.

In this paper, we propose the Vanilla Feature Distillation Adversarial Training (VFD-Adv) to achieve a better trade-off between accuracy and robustness by distilling predictive representations from a highly accurate vanilla model. The basic idea is that preserving those non-robust but predictive features in adversarial training could mitigate accuracy degradation. To this end, we propose to constrain the latent representations from the adversarial training model to be close to that of the vanilla model. Please note that the vanilla model is always fed by clean examples, while the adversarial training model accepts both clean and adversarial examples. In this way, those non-robust but predictive features will be preserved, achieving high accuracy and robustness at the same time in adversarial training. Exhaustive experiments on various datasets, classification models, and adversarial training algorithms demonstrate the effectiveness of our proposed method. In summary, our main contributions are three-fold:

•

To the best of our knowledge, we give the first attempt to introduce vanilla feature distillation into adversarial training, which achieves a better trade-off between accuracy and robustness.
•

The proposed vanilla feature distillation adversarial training could be a universally adaptable plug-in for existing related methods to boost accuracy and robustness.
•

Extensive experiments on diverse classification models, datasets, and adversarial training methods demonstrate the superior performance of our method in terms of accuracy and robustness.

2 Related works

2.1 Adversarial Attack

Since Adversarial attacks were first introduced in [10], many attack algorithms have been proposed to analyze the vulnerability of deep neural networks. This line of research could be roughly divided into two categories according to the knowledge of the adversary, i.e., white-box attacks and black-box attacks. Some studies focus on white-box attacks, where the adversary has full access to the parameters of the target model. With the full knowledge of the target model, attacks can be conducted by Fast Gradient Sign Method (FGSM)[10], Momentum Iterative Method (MIM)[8], Projected Gradient Descent (PGD)[20], and many other methods [28, chakraborty2018adversarial]. Besides, several works focus on black-box attacks, in which the adversary cannot access the target model[19]. In this case, the adversary carries out the attack with the transferability of adversarial examples, i.e., adversarial examples crafted from one model can effectively attack the other model. Without knowing the parameters of the target model, Carlini Wagner Attack (C $\&$ W)[3], GAP [24], Feature Importance-aware Attack (FIA)[31], etc. can be implemented[2, 23]. In this paper, we aim to help the target model achieve a better trade-off between accuracy and robustness in both white-box attacks and black-box attacks.

2.2 Adversarial Training

In order to defend against adversarial attacks, adversarial training[20, 30, 34] was proposed to improve the robustness of a model by augmenting the training dataset with adversarial examples when training. Zhang et al.[34] theoretically proved a trade-off between robustness and accuracy in adversarial training and proposed TRADES to trade adversarial robustness against accuracy. Based on the TRADES, Wang et al.[30] investigated the distinctive influence of misclassified and correctly classified examples and proposed Misclassification Aware adveRsarial Training (MART). These methods have successfully enhanced the robustness of deep models, while both of them failed to provide a model with high robustness and high accuracy. Recently, Some researchers found that adversarial training can be improved with knowledge distillation[5, 7, 25]. Helper-based adversarial training (HAT)[25] and Learnable Boundary Guided Adversarial Training (LBGAT)[7] achieved a better trade-off with the knowledge transferred from third-party models. [5] used smoothed labels from Knowledge Distillation (KD) to calibrate the notorious overconfidence of logits generated by pre-trained models. However, they only distill from the logits of the teacher model, while it is difficult for the training model to learn those high accuracy features from the teacher model. By contrast, our method inherits the high accuracy knowledge from the vanilla model by approximating the extracted features to the vanilla features, achieving a better trade-off between accuracy and robustness.

3 Preliminary

To better understand our proposed method, we will briefly introduce deep neural network, adversarial attack, adversarial training, and knowledge distillation in this section.

Deep Neural Networks. In this paper, we focus on deep neural network based images classification models. A Deep Neural Network $F$ with parameters $\theta$ can be denoted as $F(x;\theta):\mathcal{X}\rightarrow\mathcal{Y}$ , where $\mathcal{X}$ is input space and $\mathcal{Y}$ is label space. At layer $l\in\{1,2,...,L\}$ , $L$ is the number of layers, we denote the output at layer $l$ as $F^{l}(x;\theta)$ . Usually, the training process of a neural network is to minimize loss function $\mathcal{L}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{J}(F(x_{i};\theta),y_{i})$ on training data $D$ where $N$ is the number of training instances and $y_{i}$ is the ground truth of $x_{i}$ , $\mathcal{J}(\cdot,\cdot)$ is usually the cross-entropy for a classification model.

Adversarial Attack. Recent studies show that deep learning models are vulnerable to adversarial examples [32]. In this paper, we focus on untargeted adversarial attack. Given a classification model $F(x;\theta)$ , the goal of untargeted adversarial attack is to find a small perturbation to generate an adversarial example $x^{adv}$ , to mislead the classifier $F(x^{adv};\theta)\neq y$ . Typically, the $\ell_{p}$ -norm of the perturbation should be less than $\epsilon$ , i.e. $\left\|x^{adv}-x\right\|_{p}\leq\epsilon$ .

Adversarial Training. Adversarial training is a method for defending against adversarial attacks by augmenting the training dataset with adversarial examples. Adversarial training can be formulated as a min-max optimization problem as follows:

\centering\arg\;\underset{\theta}{\min}\;\underset{x^{adv}}{\max}\;\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\mathcal{L}(F\left(x^{adv};\theta\right),y)\right],\;s.t.\;\left\|x^{adv}-x\right\|_{p}\leq\epsilon,\@add@centering

(1)

where $x^{adv}$ is the adversarial example that can maximize loss within $\ell_{p}$ -normXQ distance $\epsilon$ , $\theta$ is parameters of model $F$ that needs to be updated to minimize the loss.

Knowledge Distillation. Knowledge distillation is a popular and successful model compression technique that can transfer knowledge from a large pre-trained teacher model to a smaller student model. Given a teacher model $F_{t}$ with parameters $\theta_{t}$ and a student model $F_{s}$ with parameters $\theta_{s}$ , knowledge distillation can be formulated as minimizing a combined loss of soft and hard labels by updating $\theta_{s}$ :

\centering\mathcal{L}_{KD}=\rho\mathcal{L}_{soft}(\boldsymbol{z_{s}},\boldsymbol{z_{t}})+(1-\rho)\mathcal{L}_{hard}(\boldsymbol{z_{s}},y),\@add@centering

(2)

where $\mathcal{L}_{soft}$ and $\mathcal{L}_{hard}$ are metric criterion, $\mathcal{L}_{soft}$ is Kullback-Leibler divergence [12], $\mathcal{L}_{hard}$ is cross-entropy, $\boldsymbol{z_{s}}$ and $\boldsymbol{z_{t}}$ are logits of $F_{s}$ and $F_{t}$ respectively, $y$ is the ground truth label and $\rho$ is a hyper-parameter to control the ratio between $\mathcal{L}_{soft}$ and $\mathcal{L}_{hard}$ . After distillation, the student network can achieve better performance than training only with the dataset.

4 Vanilla Feature Distillation Adversarial Training

In this paper, we propose Vanilla Feature Distillation Adversarial Training (VFD-Adv) to realize a better trade-off between robustness and accuracy. This section will first overview the proposed VFD-Adv and then describe the detailed design of network and loss functions. Finally, we will further discuss the training strategy of VFD-Adv.

4.1 Overview of VFD-Adv

Since adversarial vulnerability is caused by non-robust and sensitive features (non-semantic and imperceptible features, easily tempered by the adversary), adversarial training serves as a typical method to improve robustness against adversarial attacks by incorporating adversarial examples into the process of model training, forcing the model to focus more on robust features thus become immune to tempered adversarial examples. However, those non-robust features are highly predictive that can be exploited by machines, and omitting them leads to large accuracy degradation on clean examples.

To solve this problem, in this paper, we propose Vanilla Feature Distillation Adversarial Training (VFD-Adv) to realize a better trade-off between accuracy and robustness by distilling predictive representations from a high accuracy vanilla model. The basic idea is that, preserving those non-robust but predictive features in adversarial training can mitigate accuracy degradation. To achieve this goal, we constrain features from the adversarial training model that takes adversarial examples and clean examples as input and make them similar to those from the vanilla model fed with the same clean examples. Therefore, those predictive features will be preserved, achieving high accuracy and robustness at the same time.

The pipeline of VFD-Adv is overviewed in Fig.1, where there are two components: 1) the high accuracy vanilla model $F_{van}$ trained only with merely clean examples, and 2) the adversarial training model $F_{adv}$ that needs to be updated in VFD-Adv. A vanilla model can be one of those off-the-shelf models or pre-trained by ourselves. And the adversarial training model is the model we want after training of VFD-Adv. Please note that the vanilla model and the adversarial training model have the same architecture, while parameters of the vanilla model should be freezed once pre-trained. Sharing the spirit of general knowledge distillation and adversarial training during the training process, the vanilla model provides vanilla features for clean examples, while the adversarial training model extracts features for both clean examples and adversarial examples. By matching features to vanilla features, the adversarial training model can learn to preserve those predictive features in adversarial training, thus achieving high accuracy and robustness at the same time.

Refer to caption — Figure 1: The Overview of the proposed VFD-Adv, which distills predictive representations from the vanilla model to align both adversarial examples and their clean counterparts in the feature space to guide adversarial training towards higher accuracy.

4.2 Loss Functions

In this part, we detail the loss functions of the above mentioned VFD-Adv. As illustrated in the overview, we design three loss functions: 1) the clean loss $\mathcal{L}_{clean}$ for supervision of clean examples, and 2) the robustness loss for adversarial training $\mathcal{L}_{adv}$ , and 3) the self-distillation loss $\mathcal{L}_{kd}$ for guiding adversarial training towards higher accuracy.

The clean loss $\mathcal{L}_{clean}$ : In the training of VFD-Adv, since our aim is to get a model $F_{adv}$ with high accuracy and high robustness, we take clean examples and their labels for one of supervisions as general adversarial training. This loss can be expressed as:

\centering\mathcal{L}_{clean}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{J}(F_{adv}(x_{i};\theta_{adv}),y_{i}),\@add@centering

(3)

where $x_{i}$ denotes a clean example, $y_{i}$ is ground truth for $x_{i}$ , $\mathcal{J}(\cdot,\cdot)$ is cross-entropy for multi-label classification.

The robustness loss $\mathcal{L}_{adv}$ : Meanwhile, we need to improve the robustness of model $F_{adv}$ . A typical way to improve robustness is adversarial training, i.e., incorporating adversarial examples into the process of model training. In this way, the adversarial training model learns to correctly classify the adversarial examples. The loss for adversarial training can be expressed as:

\displaystyle\mathcal{L}_{adv}=\frac{1}{N}\sum_{i=1}^{N}\arg\underset{x_{i}^{adv}}{\max}\mathcal{\phi}(F_{adv}(x_{i}^{adv};\theta_{adv}),\hat{y_{i}}),\qquad s.t.\left\|x^{adv}_{i}-x_{i}\right\|_{p}\leq\epsilon,

(4)

where $\hat{y_{i}}$ can be either soft labels (e.g., logits in TRADES[34]) or hard labels (ground truth in ALP[14]), $\mathcal{\phi}(\cdot,\cdot)$ is a distance metric (e.g., Kullback-Leibler Divergence in TRADES[34], cross-entropy in [14]). $\mathcal{L}_{adv}$ itself tends to force the adversarial model to omit non-robust but predictive features of data, reaching a dilemma between gaining more robustness and preserving higher accuracy.

The self-distillation loss $\mathcal{L}_{KD}$ : The design of $\mathcal{L}_{KD}$ begins with knowledge distillation. Knowledge distillation is one of the most popular techniques used to transfer knowledge from one model to another. In particular, when the architectures are identical, this is called self-distillation. The self-distilled model achieves higher accuracy on held-out data [21, 35]. The basic idea is that if we constrain features from the adversarial training model that takes adversarial examples and clean examples as input and make them similar to those from the vanilla model fed with the same clean examples, non-robust but predictive features will be preserved during adversarial training thus mitigate accuracy degradation. Therefore, to get those predictive features, we pre-trained a vanilla model that is optimized towards high accuracy only with clean examples. Then, we distill knowledge from its intermediate layer. More specifically, we call this knowledge vanilla features (i.e., output at layer $k$ of the model for clean examples in our setting). The design of $\mathcal{L}_{KD}$ is defined as:

\mathcal{L}_{KD}=\frac{1}{N}\sum_{i=1}^{N}(\Gamma(F_{adv}^{k}(x^{adv}_{i};\theta_{adv}),F_{van}^{k}(x_{i};\theta_{van}))+\Gamma(F_{adv}^{k}(x_{i};\theta_{adv}),F_{van}^{k}(x_{i};\theta_{van}))),

(5)

where $\Gamma(\cdot,\cdot)$ is a distance metric, in this paper we use Euclidean distance. $k$ denotes layer number, $F_{adv}^{k}$ is the output of adversarial training model at layer $k$ , $F_{van}^{k}$ is the output of clean examples in vanilla model at layer $k$ (i.e., vanilla features). The first term aims to match features of adversarial examples extracted by the adversarial model to clean counterparts in feature space, while the second term will force features of clean examples extracted by the adversarial model to approximate vanilla features. Different from traditional self-distillation, $\mathcal{L}_{KD}$ forces both features of clean examples $F_{adv}^{k}(x)$ and features of adversarial examples $F_{adv}^{k}(x^{adv})$ to match vanilla features, thus adversarial model will not omit non-robust features and mitigate accuracy degradation.

4.3 Training Strategy

In this part, we demonstrate the training strategy of VFD-Adv. Instead of proposing a new adversarial training paradigm, our method is a plug-in for existing adversarial training methods. Based on Eq. 3, Eq. 4 and Eq. 5, we first pre-train a high accuracy vanilla model $F_{van}$ with only clean examples, which architecture is the same as the adversarial training model $F_{adv}$ . Then, we get the output at the $k_{th}$ layer of clean examples as vanilla features $F^{k}_{van}(x)$ . At the same time, the corresponding output for adversarial training model of adversarial examples $F^{k}_{adv}(x_{adv})$ and clean examples $F^{k}_{adv}(x)$ are extracted. Then, we update the model under the supervision of $\mathcal{L}_{clean}$ to achieve better accuracy, adversarial training loss $\mathcal{L}_{adv}$ to improve robustness and self-distillation loss $\mathcal{L}_{kd}$ to mitigate accuracy degradation. Overall, the objectives of VFD-Adv can be formulated as follows:

\mathcal{L}_{total}=\mathcal{L}_{clean}+\beta\cdot\mathcal{L}_{adv}+\lambda\cdot\mathcal{L}_{KD},

(6)

where hyperparameters $\beta$ and $\lambda$ are used to adjust the contribution of each term in the total loss. A large value of $\beta$ will force the model to update towards minimizing loss on adversarial examples, and the robustness of the model will increase while the accuracy on clean examples will decrease. $\lambda$ controls feature difference between extracted features and vanilla features. Detailed training algorithm of VFD-Adv can be found in Algorithm 1.

Algorithm 1 Training of VFD-Adv

1:Clean examples

x

, adversarial examples

x^{adv}

, model parameters

\theta

, lables

y

, adversarial training model

F_{adv}

, vanilla model

F_{van}

, the model layer output

F^{k}(\cdot)

, perturbation radius

\varepsilon

, update amplitude

\alpha

, data set

mathcal{D}

, number of epochs

N

, hyperparameter

\lambda

and

\beta

, learning rate

\eta

2:Model

F_{adv}(\theta_{adv})

3:Initialize the adversarial training model

F_{adv}

and freeze vanilla model

F_{van}

4:For epoch = 1 to E:

5: For batch

(x,y)

D

6: Extract features of clean examples

F_{van}^{k}(x;\theta_{van})

from

F_{van}

given

x

x^{adv}={max}\mathcal{J}(F_{adv}(x;\theta_{adv}),y)

8: Clip

x^{adv}

to meet

\left\|x^{adv}-x\right\|_{2}\leq\varepsilon

\mathcal{L}_{clean}=\mathcal{J}(F_{adv}(x;\theta_{adv}),{y})

10:

\mathcal{L}_{adv}=\mathcal{\phi}(F_{adv}(x^{adv};\theta_{adv}),{y})

11: Extract features of clean examples

F_{adv}^{k}(x;\theta_{adv})

from model

F_{adv}

given

x

12: Extract features of adversarial examples

F_{adv}^{k}(x^{adv};\theta_{adv})

from model

F_{adv}

given

x^{adv}

13:

\mathcal{L}_{KD}=\left\|F_{van}^{k}(x;\theta_{van})-F_{adv}^{k}(x;\theta_{adv})\right\|_{2}+\left\|F_{van}^{k}(x;\theta_{van})-F_{adv}^{k}(x^{adv};\theta_{adv})\right\|_{2}

14:

\mathcal{L}_{total}=\mathcal{L}_{clean}+\beta\cdot\mathcal{L}_{adv}+\lambda\cdot\mathcal{L}_{KD}

15:

\theta_{adv}=\theta_{adv}-\eta\nabla_{\theta_{adv}}\mathcal{L}_{total}

16: EndFor

17:EndFor

5 Experiments

5.1 Experimental setting

Datasets and Classifiers. For a fair comparison, we follow the previous works[30, 7, 5] to conduct experiments on CIFAR10 and CIFAR100[15] . The CIFAR10 dataset has 10 classes and per class contains 6,000 images, while the CIFAR100 dataset has 100 classes and per class contains 600 images. For each dataset, We split it into one training set and one test set in a ratio of 5:1. In addition, we evaluate our method on ResNet18[11] and VGG16[27] which are widely used in adversarial robustness benchmarks.

Attack Methods. We consider three typical attack methods, i.e., PGD (step=20, $\ell_{\infty}$ , $\varepsilon$ =8/255), FGSM ( $\ell_{\infty}$ , $\varepsilon$ =8/255) and C $\&$ W (kappa=0, step=1000, lr=0.01). Moreover, we evaluate the robustness of the model in two attack settings (i.e., white-box setting and black-box setting), and we use WidResNet[33] as the surrogate model to generate adversarial examples in black-box setting.

Baselines and Parameter Settings. To verify the effectiveness of the proposed method, we compare it with four related state-of-the-art adversarial training methods (i.e., ALP[14], TRADES[25], MART[30], HAT[25]) and three enhanced methods (i.e., KD[5], BGAT and LBGAT[7]) which can further improve the performance. For these methods, we follow their default parameter settings and use the SGD optimizer[6] with a learning rate of 0.1 and a batch size of 128 to train corresponding robust models. We conduct all the experiments on a server with one Nvidia Tesla V100 16GB GPU card.

Table 1: Robust and clean accuracy of ResNet18 under several white-box and black-box (WidResNet is the surrogate model) attacks on CIFAR10. The best results are highlighted in bold.

	black-box attack			white-box attack
	PGD	FGSM	C $\&$ W	PGD	FGSM	C $\&$ W	ACC
Vanilla	8.2%	41.1%	90.5%	0.0%	22.0%	0.0%	94.4%
ALP	83.1%	82.5%	81.4%	51.8%	55.9%	29.7%	83.7%
ALP+KD	82.9%	83.0%	84.8%	51.9%	58.4%	33.1%	84.0%
ALP+BGAT	83.5%	82.5%	84.3%	52.2%	57.3%	30.8%	84.1%
ALP+LBGAT	83.3%	82.7%	83.4%	52.1%	59.6%	31.4%	84.5%
ALP+VFD-Adv	86.1%	86.0%	87.8%	52.6%	61.3%	33.1%	88.3%
TRADES	79.9%	79.5%	80.1%	53.5%	56.7%	53.2%	81.1%
TRADES+KD	81.1%	79.6%	80.3%	52.8%	56.8%	50.1%	82.1%
TRADES+BGAT	80.4%	81.6%	81.3%	53.7%	57.2%	51.3%	82.9%
TRADES+LBGAT	81.2%	81.7%	82.1%	54.1%	57.6%	52.7%	83.1%
TRADES+VFD-Adv	86.3%	85.4%	86.8%	54.7%	59.8%	55.5%	87.5%
MART	81.0%	80.8%	82.3%	55.4%	59.5%	46.3%	82.3%
MART+KD	79.9%	79.3%	81.0%	54.9%	58.4%	45.7%	81.1%
MART+BGAT	81.0%	80.5%	82.1%	54.4%	58.7%	43.4%	82.1%
MART+LBGAT	80.9%	80.4%	82.0%	55.2%	59.2%	45.7%	82.0%
MART+VFD-Adv	88.4%	86.8%	87.9%	54.6%	61.6%	47.1%	88.7%
HAT	82.5%	81.9%	83.8%	49.8%	55.4%	44.9%	83.9%
HAT+KD	84.5%	84.1%	85.8%	50.1%	56.8%	43.1%	84.9%
HAT+BGAT	83.2%	82.7%	84.2%	50.4%	56.9%	43.8%	84.2%
HAT+LBGAT	84.0%	83.2%	84.8%	50.0%	56.1%	45.2%	84.9%
HAT+VFD-Adv	86.5%	85.8%	87.6%	52.2%	59.5%	46.5%	88.3%

5.2 Evaluation on Robustness and Accuracy

Fig. 2 shows the comparison of the accuracy vs. robustness trade-off between the original TRADES and its corresponding enhanced versions. We vary the robustness parameter $\beta$ in TRADES to obtain models with different robustness. We can observe that the models trained by different methods exhibit similar robustness when using the same $\beta$ , but our method (i.e., TRADES+VFD-Adv) outperforms the others significantly in clean accuracy, achieving a better trade-off between accuracy and robustness.

To quantitatively compare the performance between the proposed VFD-Adv and the baselines, we present the robust and clean accuracy of corresponding trained robust ResNet18 under several typical attack methods. Table 1 shows the results on CIFAR10 and Table 2 shows the results on CIFAR100.

As shown in Table 1, our method significantly improves the robust and clean accuracy of the original adversarial training methods in most cases (white-box attack or black-box attack). For example, compared with the original TRADES, the TRADES+VFD-Adv (i.e., equipped with our proposed method) can further improve the robust accuracy by 4.3% and clean accuracy by 6.4% on average. Meanwhile, the TRADES+VFD-Adv also outperforms those state-of-the-art methods (e.g., equipped with LBGAT, etc.) 3.1% in robust accuracy and 4.4% in clean accuracy on average, achieving the best performance. Moreover, we can draw the same conclusions when combining our method with other adversarial training methods (i.e., ALP, MART, HAT), which indicates the scalability and effectiveness of our method.

Table 2: Robust and clean accuracy of ResNet18 under several white-box and black-box (WidResNet is the surrogate model) attacks on CIFAR100. The best results are highlighted in bold.

	black-box attack			white-box attack
	PGD	FGSM	C&W	PGD	FGSM	C&W	ACC
Vanilla	17.4%	18.9%	69.9%	0.0%	7.3%	0.0%	75.2%
ALP	50.5%	51.6%	51.2%	30.0%	31.7%	23.2%	51.4%
ALP+KD	55.0%	54.9%	54.5%	30.4%	29.8%	20.1%	56.0%
ALP+BGAT	52.1%	49.5%	50.2%	30.4%	33.9%	21.1%	56.3%
ALP+LBGAT	59.3%	58.5%	58.1%	30.5%	31.1%	22.0%	59.4%
ALP+VFD-Adv	62.3%	61.6%	62.9%	31.8%	33.0%	23.2%	63.2%
TRADES	55.0%	54.4%	55.9%	25.4%	28.0%	24.5%	56.0%
TRADES+KD	56.2%	55.5%	57.4%	29.3%	31.7%	27.3%	57.4%
TRADES+BGAT	59.6%	59.0%	61.1%	29.6%	32.8%	27.5%	61.2%
TRADES+LBGAT	54.7%	55.9%	57.6%	28.9%	32.5%	26.3%	57.6%
TRADES+VFD-Adv	64.8%	63.1%	67.2%	30.1%	34.1%	27.7%	67.3%
MART	53.9%	53.1%	54.8%	32.2%	34.0%	28.2%	54.9%
MART+KD	46.7%	46.4%	47.6%	30.3%	31.3%	30.4%	56.6%
MART+BGAT	57.3%	56.3%	58.5%	29.8%	32.7%	29.9%	58.6%
MART+LBGAT	56.1%	55.5%	57.2%	32.4%	33.5%	30.3%	57.2%
MART+VFD-Adv	62.1%	62.9%	62.4%	32.5%	35.4%	30.4%	63.8%
HAT	57.5%	56.8%	59.0%	26.5%	29.5%	20.2%	59.0%
HAT+KD	58.2%	57.6%	59.2%	25.3%	29.0%	19.0%	59.2%
HAT+BGAT	60.4%	59.5%	61.8%	27.2%	30.8%	19.1%	61.8%
HAT+LBGAT	60.1%	59.4%	60.1%	27.7%	30.0%	18.7%	61.2%
HAT+VFD-Adv	62.7%	62.4%	63.1%	28.0%	32.8%	20.1%	64.5%

In addition, we also present the comparison results on a more complex classification task (i.e., CIFAR100) in Table 2. The results show that our method still can improve the robust accuracy by 2.3% and clean accuracy by 4.5% on average compared with the other state-of-the-art methods, demonstrating the superior performance and the generality of our proposed method on different datasets. In conclusion, the proposed method mitigates the accuracy degradation significantly and improve the robustness. Meanwhile, the proposed method also can be adapted to existing related works flexibly as an extra effective regularization item. We show the experimental results of VGG16 in supplementary file, which show better improvement compared to that on ResNet18.

5.3 Feature Visualization Analysis

Furthermore, to qualitatively analyze the effectiveness of our proposed VFD-Adv, we randomly select 5 classes in CIFAR10 and utilize t-SNE[29] to visualize features of 500 clean examples per class and corresponding adversarial examples generated by PGD. As shown in Fig. 3, the first row presents the t-SNE results of clean examples and the second row presents the results of corresponding adversarial examples. For the vanilla model, we can observe that there are 5 clear separable clusters in the case of clean examples but 10 clusters in the case of adversarial examples. The phenomenon indicates that the vanilla model cannot extract right discriminate features from these adversarial examples hence classify them into wrong classes. For all existing defense methods (e.g., TRADE), features of adversarial examples are separated into 5 clusters to some degree, but we can observe that there would appear unexpected obscure boundaries in clean examples, thus leading to degradation in clean accuracy. In contrast, as shown in the most right column, our model can force features of both adversarial examples and clean examples of the adversarial model close to vanilla features, thus realizing clear boundaries in these clusters for both clean examples and adversarial examples. These results are consistent with the rationality of the proposed method.

5.4 Effect of Parameters in VFD-Adv

There are two parameters, i.e., the choice of vanilla feature layer $k$ and the distillation parameter $\lambda$ , which affect the performance of the proposed VFD-Adv. Here, we adopt ResNet18 trained on CIFAR10 as the vanilla model to explore the effects of the two parameters. Since ResNet18 has four residual blocks, we choose the output of the last Conv layer of each block (e.g., Layer_1, etc.) and logit layer respectively, and modify $\lambda$ from 0 to 0.04 with a step of 0.005. As shown in Fig. 4, we can observe that the proposed method would achieve the best performance in both robust and clean accuracy when using the layer of the last residual block (i.e., Layer_4). Compared to early layers, the Layer_4 contains more high-level and task-specific information while keeping more spatial information than the logit features, hence the Layer_4 comes as a better choice. In addition, the distillation parameter $\lambda$ also influences the robust and clean accuracy significantly. The distillation parameter $\lambda$ is expected to guide the robust model to learn more predictive features like the vanilla model, but also make the robust model less robust due to relying on such non-robust features. Therefore, as shown in Fig. 4, a larger $\lambda$ tends to yield higher clean accuracy but diminishes the robustness while a smaller $\lambda$ would achieve better robustness with compromising the clean accuracy.

6 Conclusion

In this work, we proposed Vanilla Feature Distillation Adversarial Training (VFD-Adv) to achieve a better trade-off between accuracy and robustness. VFD-Adv conducts knowledge distillation from a high-accuracy pre-trained model to guide the adversarial training model to learn more predictive features, which allows the model maximally preserve the accuracy and gain robustness as well. Specifically, the proposed method can be flexibly adapted to and boost existing adversarial training methods. Extensive experiments conducted across different datasets, network architectures, and adversarial training algorithms demonstrate the state-of-the-art performance of our method. We hope that our method can serve as the adversarial robustness benchmark and inspire the community to further ameliorate the accuracy-robustness trade-off.

Broader Impact and Limitations: It is crucial to train a robust model against adversarial attacks for deep learning models. While adversarial training generates a robust model, it also causes a decrease in the model’s accuracy. This paper proposes Vanilla Feature Distillation Adversarial Training (VFD-Adv) to improve adversarial training and effectively increase the accuracy of robust models, but also introduces a little bit more time for training compared to typical adversarial training. Our work gives a worthwhile effort for further adversarial robustness research and has no known socially detrimental effects.

References

[1] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356, 2021.
[2] Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Practical black-box attacks on deep neural networks using efficient query mechanisms. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–169, 2018.
[3] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
[4] Dian Chen, Hongxin Hu, Qian Wang, Li Yinli, Cong Wang, Chao Shen, and Qi Li. Cartl: Cooperative adversarially-robust transfer learning. In International Conference on Machine Learning (ICLM), pages 1640–1650, 2021.
[5] Tianlong Chen, Zhenyu Zhang, Sijia Liu, Shiyu Chang, and Zhangyang Wang. Robust overfitting may be mitigated by properly learned smoothening. In International Conference on Learning Representations (ICLR), 2020.
[6] J Michael Cherry, Caroline Adler, Catherine Ball, Stephen A Chervitz, Selina S Dwight, Erich T Hester, Yankai Jia, Gail Juvik, TaiYun Roe, Mark Schroeder, et al. Sgd: Saccharomyces genome database. Nucleic Acids Research, 26(1):73–79, 1998.
[7] Jiequan Cui, Shu Liu, Liwei Wang, and Jiaya Jia. Learnable boundary guided adversarial training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15721–15730, 2021.
[8] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9185–9193, 2018.
[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR), 17(1):2096–2030, 2016.
[10] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In In Proceedings of the European Conference on Computer Vision (ECCV), pages 630–645, 2016.
[12] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages IV–317, 2007.
[13] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. Advances in Neural Information Processing Systems (NIPS), 32, 2019.
[14] Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), 25, 2012.
[17] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al. Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems, pages 195–231. 2018.
[18] Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, et al. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 163–168. IEEE, 2011.
[19] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
[20] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[21] Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems (NIPS), 33, 2020.
[22] Thomas B Moeslund and Erik Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding (CVIU), 81(3):231–268, 2001.
[23] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM Computer and Communications Security (CCS), pages 506–519, 2017.
[24] Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4422–4431, 2018.
[25] Rahul Rade and Seyed-Mohsen Moosavi-Dezfooli. Helper-based adversarial training: Reducing excessive margin to achieve a better accuracy vs. robustness trade-off. In ICML 2021 Workshop on Adversarial Machine Learning, 2021.
[26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (NIPS), 28, 2015.
[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[28] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
[29] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research (JMLR), 9(11), 2008.
[30] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations (ICLR), 2019.
[31] Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren. Feature importance-aware transferable adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7639–7648, 2021.
[32] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 30(9):2805–2824, 2019.
[33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
[34] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning (ICML), pages 7472–7482. PMLR, 2019.
[35] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3713–3722, 2019.

Vanilla Feature Distillation for Improving the Accuracy-Robustness Trade-Off in Adversarial Training

Abstract

1 Introduction

2 Related works

2.1 Adversarial Attack

2.2 Adversarial Training

3 Preliminary

4 Vanilla Feature Distillation Adversarial Training

4.1 Overview of VFD-Adv

4.2 Loss Functions

4.3 Training Strategy

5 Experiments

5.1 Experimental setting

5.2 Evaluation on Robustness and Accuracy

5.3 Feature Visualization Analysis

5.4 Effect of Parameters in VFD-Adv

6 Conclusion

References

Vanilla Feature Distillation for Improving the Accuracy-Robustness Trade-Off in
Adversarial Training