Exploring Robust Features for Improving Adversarial Robustness

Hong Wang, Yuefan Deng, Shinjae Yoo, Yuewei Lin H. Wang, S. Yoo and Y. Lin are with Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, USA.Y. Deng is with the department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA.*Y. Lin is the corresponding author.

Abstract

While deep neural networks (DNNs) have revolutionized many fields, their fragility to carefully designed adversarial attacks impedes the usage of DNNs in safety-critical applications. In this paper, we strive to explore the robust features which are not affected by the adversarial perturbations, i.e., invariant to the clean image and its adversarial examples, to improve the model’s adversarial robustness. Specifically, we propose a feature disentanglement model to segregate the robust features from non-robust features and domain specific features. The extensive experiments on four widely used datasets with different attacks demonstrate that robust features obtained from our model improve the model’s adversarial robustness compared to the state-of-the-art approaches. Moreover, the trained domain discriminator is able to identify the domain specific features from the clean images and adversarial examples almost perfectly. This enables adversarial example detection without incurring additional computational costs. With that, we can also specify different classifiers for clean images and adversarial examples, thereby avoiding any drop in clean image accuracy.

Index Terms:

Adversarial attacks, Robustness, Disentangle.

I Introduction

The rapid development of deep neural networks (DNNs) has achieved great success in a variety of domains. However, the pervasive brittleness of DNNs, i.e., the vulnerability against the adversarial examples (AEs), which are the data with imperceptible perturbations added, has drawn significant attention [1] because it raises doubts about DNNs being used in safety-critical applications in real-world scenarios.

While tremendous effort has been made to explore the mechanism of the adversarial examples, it is still an open problem. [1] suggests that adversarial examples are low-probability but dense statistical variation from the dataset. [2] argues that the linearity of deep models reduces their ability to resist adversarial examples. There are also works that try to explain the effect of adversarial examples by the property of the data itself instead of the training scheme. [3] shows that robustness requires a larger sample complexity, that is, a larger training set. More recently, Ilyas et. al. [4] explain the phenomenon from the viewpoint of the inherent property of the data, that is, the fragility against adversarial examples comes from the non-robust (NR) features of the data. Because the model doesn’t have any restrictions to use human-meaningful features to train the model, it may prone to rely on human-incomprehensible features to conduct the training. For natural images, these NR features are highly predictive while for adversarial examples, they could be easily perturbed and provides incorrect output. Correspondingly, the authors also demonstrate the existence of robust features that would preserve the robustness.

Motivated by [4], we strive to explore the robust features to improve the model’s robustness. As a feature representation of an image is usually an entanglement of different types of features, we propose a feature disentanglement model to disentangle the robust features from other types of features. We argue there are three types of features in the raw features. The first one is the robust (R) features which are the intrinsic representations of a specific class, such as shapes and colors which should be invariant to different domains, i.e., the natural (clean) images and adversarial examples if we treat them as different domains. As shown in the second column in Figure 1, the robust features remain unchanged for the clean image and its adversarial example. The second one is the non-robust (NR) features which are some human-incomprehensible representations that are related to certain details. NR features from natural images and adversarial examples usually present different classification predictions as they can be easily contaminated by small perturbations. As shown in the third column in Figure 1, the non-robust features of the clean image, although different from robust features, focus on the regions around the object and may pay attention to some complementary information to robust features. In contrast, The non-robust features of the adversarial example focus on the regions far from the object and thus lead to an incorrect prediction. Both R and NR features are class-relevant but domain irrelevant. In addition to R/NR features, we also argue that there are other features relevant to the domains given the domain shift (e.g., the misalignment of statistic distribution and classification performance) between clean images and adversarial examples, we referred to them as domain specific (DS) features. A discriminator is jointly trained in our framework to classify DS features into different domains.

Refer to caption — Figure 1: Grad-CAM [5] visualization of robust (R), non-robust (NR), and domain specific (DS) features of a natural image and its adversarial example.

We visualize the DS features in the last column in Figure 1. It shows the focusing region of the domain discriminator for clean image and its adversarial example, which should have no association with the class label, indicating a lack of semantic meaning. Once well-trained, the disentangled robust features from our framework can preserve a higher accuracy against adversarial attacks. Moreover, the adversarially trained models usually suffer from a performance drop for the natural images. For example, a naturally trained model can achieve approximately 96 $\%$ testing accuracy on clean images while this number will drop to around 86 $\%$ for a traditionally adversarial training model [6]. As a side product, the discriminator obtained from our framework performs outstandingly to tell whether a given image’s domain specific features belong to a clean one or an adversarial example, and thus it can be applied for the adversarial example detection. For an image that is classified to be a natural image by our discriminator, it will be conducted by the model trained on natural training, thus, we can raise the testing accuracy on natural images to be almost the same as the natural-trained model.

There are recent studies trying to improve adversarial robustness through feature disentanglement or distillation. [7] splits features into class-specific and class-irrelevant ones, [8] utilizes an adversarial training strategy to extract the domain invariant features. [9] decomposes an image into class-essential and class-redundant via a variational auto-encoder. [10] also tries to find the attack-invariant features but it uses a pre-processing framework. None of them explicitly disentangle robust and non-robust features. [11] extract robust by feature channel selection strategy and thus the extracted robust features are not purified. Our framework explicitly disentangles the R/NR and DS features and improves the adversarial robustness.

From the domain adaptation/generalization point of view, the natural images and the adversarial examples grow into two different domains because of the misalignment of statistic distribution and model performance [8]. A natural question is that, while we can generate adversarial examples by one or more known attacks as source domains in training, is our model general enough to defend other unseen (in training) attacks (target domain)? This is a typical domain generalization problem [12]. In the field of domain generalization, representation learning is a category of methods in order to learn domain-agnostic features that can generalize from source domain to unseen domain by either decomposition [13] or generative modeling [14], among which feature disentanglement provides an effective way to disentangle representation into domain-specific features and domain-invariant features [12, 15]. Variants of feature-disentanglement frameworks may tackle the problem differently. [14] disentangle the representation into domain-invariant, domain-specific and class-irrelevant features. [16] learns three latent spaces for domain, class and other variations independently. In this paper, we explicitly disentangle the domain-agnostic features (both R and NR), and further segregate the robust features that remain the prediction correct under different attacks.

The extensive experiments on three widely used datasets, i.e., CIFAR-10, CIFAR-100 and Tiny-ImageNet, with different attacks demonstrate that robust features obtained from our model improve the model’s adversarial robustness compared to the state-of-the-art approaches.

In summary, the contributions of this paper are as follows:

•

We proposed a feature disentanglement based framework that can explicitly segregate robust features, that are invariant to the attacks, out of non-robust and domain-specific features.
•

The robust features show higher adversarial robustness than the state-of-the-art adversarial defense models.
•

As side products, the discriminator and domain-specific features can successfully distinguish between natural images and adversarial examples. With that, we can specify different classifiers for clean images and adversarial examples, and thus our model has almost no sacrifice on the accuracy of clean images.

II Related Works

II-A Adversarial Attacks.

Following the discovery of deep neural networks’ vulnerability to adversarial examples, numerous research efforts have been dedicated to producing stronger adversarial attacks. [1] considers the generation of adversarial examples as a box-constrained optimization problem. [2] proposes a one-step attack named fast gradient sign method (FGSM) and leads out a lot of variants including basic iterative method (BIM) [17], which is an iterative version of FGSM and MI-FGSM [18] where momentum is integrated. In addition, the projected gradient descent (PGD) proposed by [6] and the CW attack proposed by [19] are widely used as a criterion to measure the robustness of a deep model. [20] attempts to quantify the adversarial robustness by computing the minimal adversarial perturbation. In [21], a per-sample attack named AutoAttack is introduced. AutoAttack is an ensemble of four attacks, APGD_CE, APGD ${}^{T}_{DLR}$ , [22] and [23]. It attacks by computing the worst case for each image. In addition to additive attacks, there are also geometric attacks such as [24, 25, 26], where geometric transformations are applied to the images to fool the classifier. Moreover, [27, 28, 17] tried to use patches in the physical world to confuse the model. Other than image classification, adversarial examples are crafted for other tasks such as objective detection [29], deep retrieval models [30], medical deep learning systems [31] and power systems [32] to investigate the robustness of the models. In addition, adversarial attacks are also deployed to achieve further performance improvement [33, 34].

II-B Adversarial Defense.

Because of the discovery of adversarial examples, defense against the attacks emerges in response. Previous works prove that adversarial-training based defending methods are among the most effective and popular ones. Adversarial training utilizes adversarial examples to train the model, that is, the algorithm will first maximize the loss function to generate adversarial examples, and then minimize the loss function to update the parameters afterward. PGD attack is used in [6], which can be considered as a standard adversarial training method. Based on this, [35] adds constraints to pull the logits of natural images and their adversarial examples closer. [36] and [37] constrain the distance between feature vectors of natural images and their adversarial example with metric learning. [38] reduces the computational complexity of adversarial training by blending the generation of adversarial examples and the optimization of the model parameters. [39, 40] utilize knowledge distillation to transfer adversarial robustness to student networks and [41] applies knowledge distillation and bidirectional metric learning to correct the perturbations from adversarial examples for both feature maps and latent vectors. [42] uses KL-divergence as a regularization to both generate the adversarial examples and train the model while [43] uses feature scattering in latent space to accomplish the generation of attacks. [9, 44] use information bottleneck to reduce redundant information and [8, 10] try to find invariant features for better robustness. [45] designs a bilateral adversarial training strategy. [46] denoises features to defend against adversarial attacks. In [47], the authors take the misclassified examples into consideration and further improve the robustness. A recent study [48] protects particular classes from adversarial attacks with the utilization of cost-sensitive classification and consequently achieves better average accuracy over all classes. Corresponding to attacks on object detection tasks, defending methods against adversarial attacks on object detection models are proposed [49].

II-C Feature disentanglement.

Feature disentanglement is a popular branch in the field of domain generalization. [14] designed a domain-agnostic learning (DAL) framework to extract domain-invariant representations by a deep adversarial disentangled autoencoder (DADA) with domain-specific and class-irrelevant features as auxiliary. [16, 50] uses Variational Auto-Encoder and Wasserstein Auto-Encoder [51] for disentanglement and generalization. [13] devise domain specific masks to tackle domain generalization problems similarly as disentanglement does. In addition, similar architecture is used in domain adaptation [52, 53], face presentation attack detection [54] and image-to-image translation [55, 56].

III Proposed Method

In this section, we introduce the process of accomplishing the disentanglement of robust, non-robust, and domain-specific features.

III-A Preliminaries

We first briefly introduce the standard adversarial training, including the adversarial examples generation and loss function. Let $F$ be a deep model parameterized by $\Theta$ and $\mathcal{D}$ be a dataset containing $N$ samples $\mathcal{D}=\left\{({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{i},y_{i}})\right\}_{i=1}^{N}$ , where $y_{i}$ is the class label. We use $(x,y)$ to refer to an image-label pair for simplicity. A standard way to generate adversarial examples is maximizing the classification loss as in Eq. 1.

x_{adv}=\max\limits_{\delta\in\Delta}\mathcal{L}(F(x+\delta),y)

(1)

$\delta$ refers to perturbation added to the image $x$ and $\Delta$ defines the constraint of an $\ell_{p}$ -norm bound of the total perturbation, such as $\ell_{2}$ and $\ell_{\infty}$ norm. $\mathcal{L}$ is the loss function. Correspondingly, the adversarial training method optimizes the parameters of the model by those adversarial examples, thus can be written in a min-max manner as shown in Eq. 2:

\min\limits_{\Theta}\mathop{\mathbb{E}}\limits_{x\in\mathcal{D}}\left[\max\limits_{\delta\in\Delta}\mathcal{L}(F(x+\delta),y)\right]

(2)

where $\mathbb{E}[\cdot]$ refers to the expectation over the whole dataset. The outer minimization in Eq. 2 is to train the target model on the adversarial examples, while the inner maximization is to generate strong adversarial examples. Currently, the most popular attack is the Projected Gradient Descent (PGD) [6], which utilizes an iterative way to obtain the adversarial examples:

x_{adv}^{t+1}=x_{adv}^{t}+\alpha\cdot\nabla_{x}\mathcal{L}(F(x_{adv}^{t}),y),\,\,\,\,\,st.\,\,\,\|x_{adv}-x\|_{\infty}\leq\epsilon

(3)

where $t\in\left\{1,2,\dots,T\right\}$ is the number of steps, $\alpha$ is the step size. The constraint ensures the perturbation strength of the adversarial example is within a $\epsilon$ -ball, which is also called attack budget, around $x$ to ensure perceptual similarity.

III-B Feature Disentanglement

There is a distribution shift between the natural (original) images and adversarial examples because of the add-on adversarial perturbations. Therefore, we consider the natural images and adversarial examples as two domains. We use $x_{adv}\in\mathcal{D}^{\prime}$ to denote images from the adversarial domain and $x\in\mathcal{D}$ for the images from the natural domain.

As shown in Figure 2, an input image is first fed into the feature extractor $G$ parameterized by $\theta$ to get the intermediate features, referred to as $f$ , $f^{\prime}$ for $x$ and $x_{adv}$ , respectively.

f=G_{\theta}(x);f^{\prime}=G_{\theta}(x_{adv})

(4)

The intermediate features, both of $f$ and $f^{\prime}$ , are the entanglements of robust, non-robust and domain-specific information. We then use three encoders, i.e., robust encoder $E_{\omega_{r}}$ , non-robust encoder $E_{\omega_{nr}}$ and domain-specific encoder $E_{\omega_{ds}}$ to disentangle them explicitly. $f$ (or $f^{\prime}$ ) is fed into these three encoders and expected to output disentangled robust features $z_{r}$ ( $z^{\prime}_{r}$ ), non-robust features $z_{nr}$ ( $z^{\prime}_{nr}$ ) and domain specific features $z_{ds}$ ( $z^{\prime}_{ds}$ ), respectively, denoted as:

\widetilde{z}_{i}=E_{\omega_{i}}(\widetilde{f})

(5)

where $i\in\{r,nr,ds\}$ , $\widetilde{z}_{i}\in\{{z}_{i},{z^{\prime}}_{i}\}$ and $\widetilde{f}\in\{f,f^{\prime}\}$ .

A natural constraint for the complete disentanglement is that the disentangled features $z_{r},z_{nr},z_{ds}$ ( $z^{\prime}_{r},z^{\prime}_{nr},z^{\prime}_{ds}$ ), should have the small dependence between each other as possible. We utilize a Kullback-Leibler (KL) divergence loss $\mathcal{L}_{kl}$ between each pair of the three disentangled features, for natural images and adversarial examples, respectively, as shown in Figure 2 (c).

Robust (R) features. R features are the intrinsic representations of a specific class and they should be robust to the perturbations, i.e., invariant between natural images and adversarial examples. Therefore, the robust features of the natural image and its adversarial example should be as similar as possible. We minimize the loss of angular distance between these two features as followed:

\mathcal{L}_{dist}(\theta,\omega_{r})=1-\frac{\left|z_{r}\cdot z^{\prime}_{r}\right|}{\|z_{r}\|_{2}\|z^{\prime}_{r}\|_{2}}

(6)

R features are what we rely on for our task, they should have good classification ability. We impose the classification loss (specifically the cross-entropy) on them, as shown in Eq. 7.

Non-robust (NR) features. NR features are some human-incomprehensible representations that are related to certain details of an object which can be easily contaminated by small perturbations and thus makes natural images and adversarial examples have different classification predictions. NR features in a natural image, $z_{nr}$ , should still make a correct prediction, while in an adversarial example, $z^{\prime}_{nr}$ , they should make an incorrect prediction. We also impose the cross-entropy loss on them. The following equation absorbs both R and NR features:

\small\mathcal{L}_{ce}(\theta,\omega_{\{r,nr\}},\phi)=-\mathop{\mathbb{E}}\limits_{z\in\{z_{r},z_{nr},z^{\prime}_{r},z^{\prime}_{nr}\}}\sum_{k=1}^{C}\mathds{1}{[k=y_{t}]}\log(C_{\phi}(z))

(7)

where $C_{\phi}(z)$ is out target classifier. As shown in Figure 2 (a), all of R and NR features branches utilize gradient descent (GD) for updating the model’s weights except the $z^{\prime}_{nr}$ branch as it makes the model output an incorrect prediction. Therefore, we update the weights of non-robust encoder $\omega_{nr}$ in the direction of increasing the cross-entropy loss, similar to that in gradient reversal layer (GRL) [57]. Specifically, $-\lambda$ is multiplied to the gradients of $\omega_{nr}$ only for optimization during backpropagation i.e. using $-\lambda\frac{\partial\mathcal{L}_{ce}}{\partial\omega_{nr}}$ instead of $\frac{\partial\mathcal{L}_{ce}}{\partial\omega_{nr}}$ .

Domain-specific (DS) features. R/NR features are all class relevant features. Due to the misalignment of statistic distribution and classification performance) between clean images and adversarial examples, there must exist some domain-specific features that represent such domain shift. The domain-specific features should capture the characteristics of the natural images and adversarial examples and thus the domain-specific features from different domains could be identified by a jointly trained discriminator $D_{\psi}$ . The domain-specific encoder $E_{\omega_{ds}}$ and $D_{\psi}$ are trained adversarially [7]. The discriminator is first trained to distinguish samples from different domains:

\begin{split}\mathcal{L}_{bce}(\psi)=&-\mathop{\mathbb{E}}\limits_{z\in\{z_{r},z_{nr},z_{ds}\}}\log(D_{\psi}(z))\\ &-\mathop{\mathbb{E}}\limits_{z\in\{z^{\prime}_{r},z^{\prime}_{nr},z^{\prime}_{ds}\}}\log(1-D_{\psi}(z))\end{split}

(8)

We then fix the discriminator and update three encoders by optimizing the adversarial loss in Eq. 9. Specifically, the R and NR encoders are trained to fool the discriminator by minimizing the adversarial loss as they are domain invariant features, while the DS encoder is trained to be classified by the discriminator.

\small\begin{split}&\mathcal{L}_{adv}(\omega_{\{r,nr\}})=\mathop{\mathbb{E}}\limits_{z\in\{z_{r},z_{nr}\}}[\log(D_{\psi}(z))]+\mathop{\mathbb{E}}\limits_{z\in\{z^{\prime}_{r},z^{\prime}_{nr}\}}[\log(1-D_{\psi}(z)]\\ &\mathcal{L}_{adv}(\omega_{ds})=-\mathop{\mathbb{E}}[\log(D_{\psi}(z_{ds}))]-\mathop{\mathbb{E}}[\log(1-D_{\psi}(z_{ds})]\end{split}

(9)

Reconstruction. While we disentangle the features, we don’t want to have any information loss. Therefore, we constraint the disentangled features $\widetilde{z}_{r},\widetilde{z}_{nr},\widetilde{z}_{ds}$ to be able to recover the original features $\widetilde{f}$ , as shown in Figure 2 (b).

\mathcal{L}_{res}(\theta_{rec},\omega_{\{r,nr,ds\}})=\mathop{\mathbb{E}}\limits_{\widetilde{z}\in\{z,z^{\prime}\}}\|R_{\theta_{rec}}([\widetilde{z}_{r},\widetilde{z}_{nr},\widetilde{z}_{ds}])-\widetilde{f}\|_{1}

(10)

where $\theta_{rec}$ is the parameter of a reconstruction module.

Domain diversify. The diversity of training data is of great importance to a model’s generalization ability [12]. We try to obtain a defense model trained on PGD attacks that will be general enough to defend against other unseen (in training) attacks. Therefore, we diversify the training domain by randomly selecting the number of steps, as well as the step size of the PGD attack. With that, we expect our model to have stronger generalizability.

The overall procedure of our proposed model is shown in Algorithm 1.

Input: Clean image set

\mathcal{D}

, epoch number

N

, batch size

b

, learning rate

\gamma

Output: Network parameter

\theta

\omega_{\{r,nr,ds\}}

\phi

\psi

\theta_{rec}

3for epoch $=1,...,N$ do

4 for minibatch $\{x^{i},y^{i}\}^{b}_{i=1}$ do

5 a. for each image

x^{i}

, generate its adversarial examples

x^{i}_{adv}

using Eq. 3 ;

6 b. feed

x^{i}

and

x^{i}_{adv}

into the feature extractor and three encoders to get disentangled features using Eq. 4 and 5;

7 c. optimize

\omega_{\{r,nr,ds\}}

by pair-wise

\mathcal{L}_{kl}

losses of disentangled features, as shown in Fig. 2 (c);

8 d. optimize

\omega_{r,nr}

with Eq. 7 for classification ability ;

9 e. optimize

\theta

\omega_{r}

with Eq. 6 to align the robust features from

x^{i}

and

x^{i}_{adv}

;

10 f. optimize

\theta

\omega_{\{r,nr,ds\}}

\phi

\psi

with Eq. 8 and Eq. 9 to train the discriminator and accomplish domain disentanglement ;

11 g. optimize

\omega_{\{r,nr,ds\}}

\theta_{rec}

for reconstruction.

13return

\theta

\omega_{\{r,nr,ds\}}

\phi

\psi

\theta_{rec}

;

Algorithm 1 Robust Feature Disentanglement.

IV Experimental Results

IV-A Experimental settings

Dataset The performance of our model is evaluated on three popular datasets: CIFAR-10, CIFAR-100 and Tiny-ImageNet. CIFAR-10 contains 60k images (the image size is $32\times 32$ ) grouped into 10 classes. 50k images for training and 10k images for testing. SVHN dataset, for the Street View House Number. The training set of SVHN contains 73257 images and the testing set has 26032 images belonging to 10 classes. CIFAR-100 is similar to CIFAR-10 except that it consists of 100 classes and 600 images for each class. Tiny-ImageNet is a subset of ImageNet, which contains 200 classes. For each class, there are 500 training images and 50 validation images. The resolution of images in Tiny ImageNet is $64\times 64$ .

Comparison methods We compare the performance of our model with the following methods: (1) undefended model (UM), which is trained by the clean image set only; (2) adversarial training (AT) [6], where the model is trained by adversarial examples generated from PGD attack; (3) TRADES [42] and (4) MART [47], two defending methods with different regularization terms. (5) LBGAT [58], which improve robustness by boundary guidance of undefended model (65) DRRDN[7] and (7) AFD-WGAN[8], which uses feature disentanglement and desensitization to defend adversarial attacks, (8) IAD [40], which utilizes knowledge distillation to transfer robustness. For IAD, we adopt the results with AT as the teacher model here. We use FGSM [2], PGD [6], CW [19] attacks to evaluate the models. We also test the models on black-box attack and AutoAttack (AA) [21], which is a parameter-free attack that ensembles four diverse attacks.

Implementation details We use the residual network as the backbone of our model. The feature extractor $G_{\theta}$ consists of two residual blocks, each contains five basic residual blocks where two $3\times 3$ convolutional layers are included. The encoder is a CNN architecture with one residual block, while the reconstruction module is composed of transposed convolutional layers. The fully connected layer is used as a classifier. We utilize a two-layer fully connected network as the discriminator. For all the datasets, the initial learning rate is 0.1 for the adversarial training of $G_{\theta}$ , $E_{\omega_{r}}$ and $C_{\phi}$ and other components have a smaller learning rate. We utilize SGD optimizer and schedule the learning rate decay at 100, 105 and 110 epochs. The batch size is 128 and we use early stopping as suggested by [59]. In order to increase the domain diversity, the attack budget $\epsilon$ is randomly selected between 8 and 12 for each batch. The number of steps is randomly sampled from $\left[8,16,24,32\right]$ and step size ranges from 2 to 4.

IV-B Adversarial robustness against different attacks

We evaluate the accuracy of our model against various attacks. We conducted five runs of our model across all settings and reported the mean and standard deviations resulting from these runs. In Table I, we show the results on CIFAR-10. The accuracy of our model on clean images in the table is directly from the robust features through the classifier. With the help of the well-trained discriminator, this accuracy can be improved to almost the same accuracy as the undefended model, i.e., around $96\%$ , we will describe this in section IV-E. The rest of Table I includes accuracy against both white-box attacks and black-box attacks. White-box attack denotes the adversary has full access to the target model, including the architecture and parameters. Our model outperforms comparison methods under most of the white-box attacks. Especially, our model shows the best performance on AA attacks which is known as the most challenging defense task. Black-box attack denotes there is no information known about the target model. We trained a surrogate model with only clean images and generate adversarial examples from this model to attack our model. As shown in Table I, our model still keeps a high accuracy against black-box attacks. To gain a more comprehensive understanding of the model’s performance under different adversarial scenarios during training, we have expanded our evaluation by incorporating results from training the model with attacks other than PGD. We specifically consider the CW attack and an adaptive version of the PGD attack, i.e., the difference of logits ratio (DLR) [21]. The results show that our proposed model performs quite well when trained under different attacks.

TABLE I: Evaluation results on CIFAR-10 under different attacks. “BB” denotes the black-box attack. The best accuracy is highlighted. We also present the results of our model when training with different attacks, namely CW and the difference of logits ratio (DLR).

Methods	clean	FGSM	PGD_∞	CW_∞	AA	PGD(BB)
UM	96.20	31.39	0	0	0	-
AT [6]	86.19	62.42	45.57	46.26	44.04	79.55
TRADES [42]	83.50	63.68	52.80	50.90	52.70	81.60
MART [47]	82.16	63.91	52.67	49.44	51.20	82.60
LBGAT [58]	88.22	-	54.66	54.29	52.23	-
DRRDN [7]	85.76	62.81	52.32	51.88	-	82.65
AFD-WGAN [8]	85.95	-	59.38	-	37.33	84.74
IAD [40]	85.09	66.54	55.45	54.63	52.29	-
Ours	83.52 $\pm$ 0.12	66.81 $\pm$ 0.79	57.97 $\pm$ 0.26	55.39 $\pm$ 0.25	53.64 $\pm$ 0.22	83.33 $\pm$ 0.21
Ours-CW	80.17 $\pm$ 0.11	65.51 $\pm$ 0.35	57.79 $\pm$ 0.30	56.02 $\pm$ 0.18	51.60 $\pm$ 0.11	80.03 $\pm$ 0.15
Ours-DLR [21]	87.20 $\pm$ 0.12	69.61 $\pm$ 0.27	63.28 $\pm$ 0.27	50.98 $\pm$ 0.29	48.30 $\pm$ 0.36	87.03 $\pm$ 0.08

The results on the SVHN dataset are shown in Table II. For SVHN, the UM model has a success rate lower than $1\%$ against PGD and CW attacks. The ML method is the best among the comparison models which can achieve over $51\%$ for both attacks. Our model outperforms the comparison methods on all attacks by a large margin and the accuracy could reach $55.20\%$ and $54.09\%$ for PGD and CW attacks.

TABLE II: Evaluation results on SVHN, under different widely used attacks.

Attacks clean FGSM PGD_∞ CW_∞ UM 96.36 $\%$ 46.33 0.33 0.37 AT 91.55 67.13 45.64 47.14 ML 83.95 70.28 51.91 51.25 TRADES 91.16 69.85 50.90 50.85 MART 91.16 67.31 48.72 50.52 Ours 91.01 $\pm$ 0.73 72.41 $\pm$ 1.10 55.20 $\pm$ 0.52 54.09 $\pm$ 0.66

The results on CIFAR-100 and Tiny ImageNet are shown in Tables III. For CIFAR-100, the undefended model (UM) misclassifies almost all the adversarial examples generated with strong attacks such as PGD and AA with only $0.01\%$ success rate for PGD and $0.02\%$ for AA. AT can increase the accuracy against PGD attack to $18.54\%$ while methods such as TRADES and AFD-WGAN can further improve the number. Our model outperforms other methods on PGD attack with accuracy $31.78\%$ . For Tiny ImageNet, it’s hard for UM model to make correct classifications against adversarial attacks. The comparison methods can improve the accuracy to over $10\%$ against PGD and AA attacks to the best. Our method greatly promoted these numbers to $24.78\%$ and $20.64\%$ separately. Moreover, although our model is trained on $\ell_{\infty}$ -norm attacks, it surpasses other comparison methods on $\ell_{2}$ -norm attacks such as CW₂ by a large margin, showing superior generalization ability of our model.

TABLE III: Evaluation results on CIFAR-100 and Tiny-ImageNet datasets under different attacks. The best accuracy in each dataset is highlighted.

Methods clean PGD_∞ CW₂ AA CIFAR-100 UM 76.76 0.01 0.52 0.02 AT 56.49 18.54 17.71 18.30 TRADES 60.32 25.11 20.52 21.10 LBGAT 60.64 30.56 - 29.33 AFD-WGAN 58.87 22.35 25.33 18.00 Ours 56.64 $\pm$ 0.57 31.78 $\pm$ 0.46 24.31 $\pm$ 0.10 27.93 $\pm$ 0.21 TI UM 64.62 0.07 0 0 AT 43.80 12.62 4.90 9.48 TRADES 37.70 13.26 4.11 12.57 AFD-WGAN 47.70 11.49 5.90 9.45 Ours 46.54 $\pm$ 0.49 24.78 $\pm$ 0.18 28.29 $\pm$ 0.32 20.64 $\pm$ 0.16

IV-C Evaluation of additional network architectures on Imagenette.

In this section, we aim to evaluate the scalability of our method in handling larger-sized images and its ability to generalize across different neural network architectures. To achieve this, we perform experiments on the Imagenette dataset, which contains larger images, using three neural network architectures: Inception v4 (Inc-v4) [60], Inception ResNet v2 (IncRes-v2) [60], and ResNet v2-152 (Res-v2-152) [61]. Imagenette is a subset of ImageNet, encompassing ten classes including tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, and parachute. The images of each class in the Imagenette dataset are identical to the original ImageNet dataset with an average image size of $469\times 387\times 3$ pixels.

For the training on the Imagenette dataset, we used a learning rate of 0.01 following [18]. The parameter numbers for Inc-v4, IncRes-v2, and Res-v2-152 are approximately 43M, 56M, and 60M, respectively. Therefore, for Inc-v4, the batch size remained at 128, whereas for IncRes-v2 and Res-v2-152, the batch sizes were reduced to 64 and 40, respectively, due to GPU limitations. All the comparison methods for each architecture use the same batch size for a fair comparison.

The results for Imagenette dataset are presented in Table IV. Our model outperforms other methods under all the architectures. For Inc-v4 and IncRes-v2, our model surpasses other methods by a large margin. Specifically for Inc-v4, our method has imporved the accuracy by $7.33\%$ (from $30.68\%$ to $38.01\%$ ) and for IncRes-v2, The accuracy has $5.57\%$ increment (from $36.92\%$ to $42.49\%$ ).

TABLE IV: Evaluation and comparison of other neural network architectures on Imagenette dataset (without pretraining).

Models	Inc-v4		IncRes-v2 (bs=64)		Res-v2-152 (bs=40)
Models	clean	PGD_∞	clean	PGD_∞	clean	PGD_∞
AT	80.71	21.76	87.69	24.38	87.95	27.34
MART	28.31	21.94	52.64	35.44	64.05	42.75
TRADES	68.82	30.68	74.98	36.92	70.19	32.48
Ours	66.93 $\pm$ 0.95	38.01 $\pm$ 0.48	76.71 $\pm$ 0.24	42.49 $\pm$ .27	73.88 $\pm$ 0.21	43.36 $\pm$ 0.19

IV-D Evaluation of k-Nearest Neighbor (k-NN) classifier

TABLE V: Evaluation and comparison between k-NN and softmax classifiers (k-NN/softmax).

Models	CIFAR-10		SVHN		Tiny ImageNet
Models	clean	PGD_∞	clean	PGD_∞	clean	PGD_∞
AT	87.1 / 86.2	47.5 / 45.6	91.5 / 91.6	45.8 / 45.6	36.6 / 42.3	20.2 / 19.6
ALP	89.6 / 89.8	48.9 / 48.5	91.4 / 91.3	52.0 / 52.2	35.2 / 41.5	20.3 / 20.0
ML	86.3 / 86.2	51.7 / 51.6	84.3 / 84.0	52.0 / 51.9	34.0 / 40.6	20.7 / 20.7
Ours	83.0 / 83.3	58.1 / 58.0	92.6 / 90.4	55.1 / 55.0	42.0 / 46.2	22.6 / 24.6

We use k-Nearest Neighbor (k-NN) method as a substitute of the original (softmax) classifier in the model following [36]. In our experiment, we use the same settings as [36] where $k$ is set to 50 and the features from the penultimate layer are used to build the k-NN classifier.

The results for both k-NN classifier and the original classifier are shown in Table V on three datasets, CIFAR-10, SVHN and Tiny ImageNet. The comparison methods include AT [6] ALP [62] SML [36]. Among the four approaches, our method can obtain a consistently higher accuracy against adversarial attacks on all datasets. Besides, although the k-NN is one of the simplest classifiers, it can achieve similar results compared with the original classifier. These quantitative results, coupled with the visualization illustrations in Section IV-I, demonstrate the features obtained by our model have better distribution in latent space. That is, the adversarial robustness of our model is from the latent space, not the classifier itself.

IV-E Adversarial example detection and a simple strategy to improve accuracy on clean image

As a side product, the discriminator and domain specific features have great capability to distinguish the clean image and adversarial example, which can be used for adversarial example detection. Built upon that, we propose to use the natural trained model for clean images and our model for adversarial examples, to improve the accuracy of clean images. Our model can achieve $99.86\%$ and $100\%$ for true positive rate and true negative rate, respectively, in adversarial example detection, which makes a significant improvement on clean images (from $83.52\%$ to $96.20\%$ ), with a negligible drop on adversarial examples (form $57.97\%$ to $57.88\%$ ).

IV-F Visualization of Domain Specific Features

We visualize the distribution of the domain specific (DS) features for 500 randomly selected samples and their adversarial examples, using t-SNE, in Figure 3. We can see that the DS features of natural images and adversarial examples are quite discriminable even in 2-dimensional space, while there are very few mis-clustering samples, we believe that they should be more easily classified in original high dimensional space.

We visualize the domain specific (DS) features, a 640-dimension vector that is obtained after the average pooling of a 640-channel feature maps output from the domain specific encoder, for natural images and their adversarial examples. Figure 4 illustrates the DS feature histograms, which show the intensity distributions of the features, of some sampled natural images and their adversarial examples. From the figure, we can see that the intensity distributions for natural images and adversarial examples are quite different, which means the DS encoder has the ability to discriminate the features from natural images and adversarial examples.

IV-G Visualization of robust and non-robust Features

In Figure 5, we visualize the distributions of the robust and non-robust features of natural images and their corresponding adversarial examples. The natural images are from the same class (frog) and the AEs are generated with PGD attack. As in Figure 5, our method successfully disentangles the robust and non-robust features. The robust features extracted from both natural images and adversarial examples are positioned within the same region. This alignment is due to the explicit pulling together achieved by the angular loss Eq. 6.

On the other hand, the non-robust features extracted from natural images and adversarial examples are visibly separated. These features are the components modified by the adversarial perturbations to decrease (natural images) and increase (adversarial examples) the classification cross-entropy loss Eq. 7 and are effectively disentangled by our proposed method. The visualizations in Figure 5 provide valuable insights into the disentanglement process, demonstrating the efficacy of our approach in capturing both robust and non-robust features for adversarial examples.

IV-H Different attack iterations

We evaluate the model’s robustness under different attack iterations, from 10 to 100. Figure 6 shows that our model consistently outperforms standard AT, while it is not sensitive to the attack iterations although the accuracy monotonically declines along with the iteration number increase, for example, our model achieves $58.38\%$ , $57.97\%$ , and $57.82\%$ for 10, 20, and 100 iterations of PGD attack, respectively

IV-I Visualization in feature space

We visualize the representation distribution of samples in feature space using t-SNE. Specifically, we show the representation distribution for all the classes and for the clean images and adversarial examples in a specific class [36], in Figure 7 and Figure 8, respectively.

In Figure 7, the triangle points with different colors represent the clean images in different classes, while the red round points are the AEs from a specific class ( horse and truck) under PGD attack. “UM”, denotes an undefended model, and shows how the adversarial attacks behave on a model that is naturally trained on clean images. PGD attack is able to drop its accuracy to 0, which is also visualized in the first column of Figure 7, where all the AEs locate far from their original class, and fit into the distributions of other classes. “AT” shows how adversarial attacks behave on a standard adversarial training model. Many of the adversarial examples are pulled back to their original class but there still are a lot of them mixed with other classes. Our model pulled the most of adversarial examples back to their original class. While we observe that our model shortened the gap between classes for clean images and it may be the reason that the classification performance of our model drops a lot on clean images, we believe it rarely affects the performance on adversarial examples based on the results shown in Table. I. For the clean image, as discussed previously, our model can keep the same performance as the naturally trained model by using a simple adversarial example detection strategy, thanks to the extraordinary performance of the discriminator and domain specific features obtained from our model.

In Figure 8, the flesh color triangles are clean images of a specific class and round points are adversarial examples crafted from the clean images of the same class. The colors represent the different predictions of those adversarial examples. In “UM” has no ability to defend against the attacks and all adversarial examples are falsely predicted. The features of adversarial examples are away from the distribution of natural images and clustered based on the prediction. For “AT”, some of the adversarial examples can be correctly classified. Although the attacks are pulled closer to the location of natural images, the successful attacks are still away and show aggregation based on class. For our model, the natural images and adversarial attacks share almost the same distribution and there is no distinct gathering for different categories, representing that the robust encoder of our model does extract the invariant features between natural images and adversarial examples, that is, the robust features.

IV-J Discussion

In [63], the authors provided constructive insights about gradient obfuscation in adversarial robustness. Gradient obfuscation is a gradient masking that leads to a false sense of security in defenses against adversarial examples. Based on the analysis in that paper, we conclude that the robustness of our model is not from gradient obfuscation, for the following three reasons: (1) We provide the performance of the model under attacks with different iterations in Figure. 6, which shows monotonically declining accuracy along with the attack iterations increasing. (2) In Table I, we show that the black-box attacks ( $83.04\%$ ) have a lower success rate (higher accuracy) than white-box attacks ( $57.97\%$ ). (3) We evaluate our model against a gradient-free attack SPSA [64] and obtain an accuracy of 80.21, which is higher than gradient-based attacks ( $57.97\%$ for PGD).

V Conclusion

In this paper, we aimed to explore the robust features, which are not affected by the adversarial perturbations, to improve the model’s adversarial robustness as well as generalizability to defend against other unseen attacks. To this end, we proposed a novel three branches feature disentanglement model to segregate the robust features out of non-robust features and domain specific features. The robust features obtained from our model improve the model’s adversarial robustness compared to the state-of-the-art approaches through extensive experiments on three widely used datasets, i.e., CIFAR-10, CIFAR-100 and TinyImageNet, under diverse attacks. Moreover, the domain discriminator and domain specific features from our model, as side products, is able to identify the clean images and adversarial examples almost perfectly without additional computational cost. With that, a simple strategy can be applied to specify different classifiers for clean images and adversarial examples, such that our model has almost no sacrifice on the accuracy of clean images.

References

[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in ICLR, 2014.
[2] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015.
[3] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, “Adversarially robust generalization requires more data,” in NeurIPS, vol. 31, 2018.
[4] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in NeurIPS, vol. 32, 2019.
[5] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
[6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR, 2018.
[7] S. Yang, T. Guo, Y. Wang, and C. Xu, “Adversarial robustness through disentangled representations,” in AAAI, vol. 35, 2021.
[8] P. Bashivan, R. Bayat, A. Ibrahim, K. Ahuja, M. Faramarzi, T. Laleh, B. Richards, and I. Rish, “Adversarial feature desensitization,” in NeurIPS, vol. 34, 2021.
[9] K. Yang, T. Zhou, Y. Zhang, X. Tian, and D. Tao, “Class-disentanglement and applications in adversarial detection and defense,” in NeurIPS, vol. 34, 2021.
[10] D. Zhou, T. Liu, B. Han, N. Wang, C. Peng, and X. Gao, “Towards defending against adversarial examples via attack-invariant features,” in ICML, 2021.
[11] J. Kim, B.-K. Lee, and Y. M. Ro, “Distilling robust and non-robust features in adversarial examples by information bottleneck,” in NeurIPS, vol. 34, 2021.
[12] J. Wang, C. Lan, C. Liu, Y. Ouyang, W. Zeng, and T. Qin, “Generalizing to unseen domains: A survey on domain generalization,” arXiv preprint arXiv:2103.03097, 2021.
[13] P. Chattopadhyay, Y. Balaji, and J. Hoffman, “Learning to balance specificity and invariance for in and out of domain generalization,” in ECCV, 2020.
[14] X. Peng, Z. Huang, X. Sun, and K. Saenko, “Domain agnostic learning with disentangled representations,” in ICML, 2019.
[15] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. Change Loy, “Domain generalization: A survey,” arXiv e-prints, pp. arXiv–2103, 2021.
[16] M. Ilse, J. M. Tomczak, C. Louizos, and M. Welling, “Diva: Domain invariant variational autoencoders,” in Medical Imaging with Deep Learning, 2020.
[17] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” ICLR Workshop, 2017.
[18] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in CVPR, 2018.
[19] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE symposium on security and privacy, 2017.
[20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in CVPR, 2016.
[21] F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in ICML, 2020.
[22] ——, “Minimally distorted adversarial examples with a fast adaptive boundary attack,” in ICML, 2020.
[23] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” in ECCV, 2020.
[24] A. Fawzi and P. Frossard, “Manitest: Are classifiers really invariant?” in BMVC, 2015.
[25] I. Goodfellow, H. Lee, Q. Le, A. Saxe, and A. Ng, “Measuring invariances in deep networks,” in NeurIPS, 2009.
[26] C. Kanbak, S.-M. Moosavi-Dezfooli, and P. Frossard, “Geometric robustness of deep networks: analysis and improvement,” in CVPR, 2018.
[27] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” in CVPR, 2018.
[28] L. Huang, C. Gao, Y. Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu, “Universal physical camouflage attacks on object detectors,” in CVPR, 2020.
[29] D. Wang, C. Li, S. Wen, Q.-L. Han, S. Nepal, X. Zhang, and Y. Xiang, “Daedalus: Breaking nonmaximum suppression in object detection via adversarial examples,” IEEE Transactions on Cybernetics, vol. 52, no. 8, pp. 7427–7440, 2022.
[30] E. Yang, T. Liu, C. Deng, and D. Tao, “Adversarial examples for hamming space search,” IEEE Transactions on Cybernetics, vol. 50, no. 4, pp. 1473–1484, 2020.
[31] Z. Wang, X. Shu, Y. Wang, Y. Feng, L. Zhang, and Z. Yi, “A feature space-restricted attention attack on medical deep learning systems,” IEEE Transactions on Cybernetics, pp. 1–13, 2022.
[32] J. Tian, B. Wang, Z. Wang, K. Cao, J. Li, and M. Ozay, “Joint adversarial example and false data injection attacks for state estimation in power systems,” IEEE Transactions on Cybernetics, vol. 52, no. 12, pp. 13 699–13 713, 2022.
[33] P. Liu, Y. Lin, Z. Meng, L. Lu, W. Deng, J. T. Zhou, and Y. Yang, “Point adversarial self-mining: A simple method for facial expression recognition,” IEEE Transactions on Cybernetics, vol. 52, no. 12, pp. 12 649–12 660, 2021.
[34] C. Shi, X. Xu, S. Ji, K. Bu, J. Chen, R. Beyah, and T. Wang, “Adversarial captchas,” IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 6095–6108, 2022.
[35] H. Kannan, A. Kurakin, and I. J. Goodfellow, “Adversarial logit pairing,” arXiv:1803.06373, 2018.
[36] C. Mao, Z. Zhong, J. Yang, C. Vondrick, and B. Ray, “Metric learning for adversarial robustness,” in NeurIPS, 2019.
[37] Y. Zhong and W. Deng, “Adversarial learning with margin-based triplet embedding regularization,” in ICCV, 2019.
[38] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” in NeurIPS, 2019.
[39] M. Goldblum, L. Fowl, S. Feizi, and T. Goldstein, “Adversarially robust distillation,” in AAAI, vol. 34, 2020.
[40] J. Zhu, J. Yao, B. Han, J. Zhang, T. Liu, G. Niu, J. Zhou, J. Xu, and H. Yang, “Reliable adversarial distillation with unreliable teachers,” in ICLR, 2021.
[41] H. Wang, Y. Deng, S. Yoo, H. Ling, and Y. Lin, “Agkd-bml: Defense against adversarial attack by attention guided knowledge distillation and bi-directional metric learning,” in ICCV, 2021.
[42] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in ICML, 2019.
[43] H. Zhang and J. Wang, “Defense against adversarial attacks using feature scattering-based adversarial training,” in NeurIPS, 2019.
[44] Z. Wang, T. Jian, A. Masoomi, S. Ioannidis, and J. Dy, “Revisiting hilbert-schmidt information bottleneck for adversarial robustness,” in NeurIPS, vol. 34, 2021.
[45] J. Wang and H. Zhang, “Bilateral adversarial training: Towards fast training of more robust models against adversarial attacks,” in ICCV, 2019.
[46] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He, “Feature denoising for improving adversarial robustness,” in CVPR, 2019.
[47] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu, “Improving adversarial robustness requires revisiting misclassified examples,” in ICLR, 2019.
[48] H. Shen, S. Chen, R. Wang, and X. Wang, “Adversarial learning with cost-sensitive classes,” IEEE Transactions on Cybernetics, pp. 1–12, 2022.
[49] H. Li, G. Li, and Y. Yu, “Rosa: Robust salient object detection against adversarial attacks,” IEEE Transactions on Cybernetics, vol. 50, no. 11, pp. 4835–4847, 2020.
[50] F. Qiao, L. Zhao, and X. Peng, “Learning to learn single domain generalization,” in CVPR, 2020.
[51] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-encoders,” arXiv preprint arXiv:1711.01558, 2017.
[52] S. Lee, S. Cho, and S. Im, “Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation,” in CVPR, 2021.
[53] Y. Chen, H. Zhang, Y. Wang, W. Peng, W. Zhang, Q. M. J. Wu, and Y. Yang, “D-bin: A generalized disentangling batch instance normalization for domain adaptation,” IEEE Transactions on Cybernetics, pp. 1–13, 2021.
[54] G. Wang, H. Han, S. Shan, and X. Chen, “Cross-domain face presentation attack detection via multi-domain disentangled representation learning,” in CVPR, 2020.
[55] H.-Y. Lee, H.-Y. Tseng, Q. Mao, J.-B. Huang, Y.-D. Lu, M. Singh, and M.-H. Yang, “Drit++: Diverse image-to-image translation via disentangled representations,” International Journal of Computer Vision, vol. 128, no. 10, pp. 2402–2417, 2020.
[56] Z. Song, O. Koyejo, and J. Zhang, “Toward a controllable disentanglement network,” IEEE Transactions on Cybernetics, vol. 52, no. 4, pp. 2491–2504, 2022.
[57] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
[58] J. Cui, S. Liu, L. Wang, and J. Jia, “Learnable boundary guided adversarial training,” in CVPR, 2021.
[59] L. Rice, E. Wong, and Z. Kolter, “Overfitting in adversarially robust deep learning,” in ICML, 2020.
[60] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in AAAI, vol. 31, no. 1, 2017.
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in ECCV, 2016.
[62] H. Kannan, A. Kurakin, and I. Goodfellow, “Adversarial logit pairing,” arXiv:1803.06373, 2018.
[63] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in ICML, 2018.
[64] J. Uesato, B. O’donoghue, P. Kohli, and A. Oord, “Adversarial risk and the dangers of evaluating against weak attacks,” in ICML, 2018.