This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CAP-GAN: Towards Adversarial Robustness with Cycle-consistent Attentional Purification

Mingu Kang School of Computing
KAIST
Daejeon, South Korea
[email protected]
   Trung Quang Tran School of Computing
KAIST
Daejeon, South Korea
[email protected]
   Seungju Cho School of Computing
KAIST
Daejeon, South Korea
[email protected]
   Daeyoung Kim School of Computing
KAIST
Daejeon, South Korea
[email protected]
Abstract

Adversarial attack is aimed at fooling a target classifier with imperceptible perturbation. Adversarial examples, which are carefully crafted with a malicious purpose, can lead to erroneous predictions, resulting in catastrophic accidents. To mitigate the effect of adversarial attacks, we propose a novel purification model called CAP-GAN. CAP-GAN considers the idea of pixel-level and feature-level consistency to achieve reasonable purification under cycle-consistent learning. Specifically, we utilize a guided attention module and knowledge distillation to convey meaningful information to the purification model. Once the model is fully trained, inputs are projected into the purification model and transformed into clean-like images. We vary the capacity of the adversary to argue the robustness against various types of attack strategies. On CIFAR-10 dataset, CAP-GAN outperforms other pre-processing based defenses under both black-box and white-box settings.

I Introduction

With Deep Neural Networks (DNNs) advancement, utilizing DNNs in various applications is in full swing. Among them, the image classification task aims to train DNNs to classify given images correctly. DNNs have shown outstanding successes in the image classification task and are being studied focusing on improving the accuracy [1, 3]. However, DNNs have been proved to be vulnerable against imperceptible noise, which is carefully crafted by an adversary, called adversarial attacks [4]. Such a weak point of DNNs raises security concerns in that the machine cannot entirely substitute for the human. One of the main concerns is that adversarial examples force DNNs to behave unexpectedly, despite the fact that adversarial examples are seemingly similar to the original one.

To combat adversarial attacks, two kinds of defense mechanisms have been mainly proposed: adversarial training [4, 5, 6], where a target model is directly re-trained with adversarial examples to obtain the adversarial robustness, and pre-processing, where another model is designed to transform inputs from malicious to clean-like inputs. Although adversarial training shows outstanding robustness across various types of attacks, it generally requires extreme training resources and significantly deteriorates generalization ability. Unlike adversarial training, pre-processing comes at a relatively low cost and shows better generalization. Thus, pre-processing is worth further being explored in that it is more applicable than adversarial training. When it comes to pre-processing, various defenses have been proposed to mitigate the effect of adversarial attacks [7, 8, 9, 10, 10, 11, 12]. They came up with the denoising function to remove the adversarial perturbation from inputs.

Refer to caption
Figure 1: The first row represents t-SNE visualization of CIFAR-10 with xiH×W×Cx^{i}\in\mathbb{R}^{H\times W\times C} and corresponding FGSM [4] adversarial data with a large magnitude of perturbation ϵ\epsilon =16, xadviH×W×Cx^{i}_{adv}\in\mathbb{R}^{H\times W\times C}(left and right on the first row respectively), which can be viewed as differences in pixel-level distribution. The second row represents differences in feature-level distribution extracted by ff : f(xi)f(x^{i}) and f(xadvi)f(x^{i}_{adv}).

Intuitively, a way to mitigate the effect of adversarial attacks is to get rid of the adversarial perturbation in a pixel-level approach. Finding a perturbation for each sample and disentangling it is plausible at first glance. Unfortunately, it is impossible to find the optimal distribution of the perturbation, which can cover all possible attacks, due to the high complexity of DNNs. Considering one of the main spirits of adversarial examples, xadv=x+δx_{adv}=x+\delta, where perturbation δ\delta tends to be imperceptible to satisfy the alignment of human perception with a small budget of the perturbation, pixel-level purification might not be suitable to mitigate the effect of adversarial attacks. Even if we can achieve proper pixel-level alignment, it would not bring feature-level alignment [8, 26].

Specifically, given an original data xx and corresponding adversarial data xadvx_{adv}, their logit can be presented as f(x)f(x) and f(xadv)f(x_{adv}) respectively, where ff is a pre-trained network. We call the logit a feature-level vector of the input extracted from ff. We visualize pixel-level and feature-level distribution to compare how adversarial attack affects both spaces. In Figure 1, when it comes to the visual domain shift x,xadvH×W×Cx,x_{adv}\in\mathbb{R}^{H\times W\times C}, it is hard to observe discernible changes that stand out, whereas the feature-level space f(x),f(xadv)Kf(x),f(x_{adv})\in\mathbb{R}^{K} has been significantly hurt by the adversarial perturbation, where KK is the number of classes. This result indicates that adversarial examples would focus more on becoming counterparts in the high-level representation space. The more sophisticated attacks are, the deeper this tendency will be.

In spite of the characteristics of adversarial attack, previous defenses only considered the pixel-level distribution [7, 13, 11], just with a hope that aligning the pixel-level distribution can bring the feature-level alignment; given defense mechanism 𝒯\mathcal{T}, it will satisfy a term, where f(𝒯(xadv))f(x)f(\mathcal{T}(x_{adv}))\approx f(x) if 𝒯(xadv)x\mathcal{T}(x_{adv})\approx x. Instead of the implicit hope, recent defenses start turning their attention to adopt feature-level objective functions to ensure that generated images are tailored for the end-task purpose, e.g., classification [12, 9, 8]. Equipped with these perspectives, we envision the purification model 𝒯\mathcal{T}, which eliminates adversarial perturbations by leveraging pixel-level and feature-level approach.

We propose a novel way to mitigate the effect of the adversarial perturbation via input transformation. Given the pre-trained model ff, we train the purification model to improve the robustness of ff for the image classification. We utilize GAN [15] training procedure to purify adversarial inputs. Concretely, CAP-GAN adopts CycleGAN [16] training scheme, where a forward pass of the model has cycle-consistency to satisfy the pixel-level consistency. For the feature-level consistency, we introduce explicit terms into 𝒯\mathcal{T} by distilling the knowledge of the pre-trained model ff. To guide the model to focus on meaningful regions, we use an attention module inspired by previous researches [17, 18]. We found that introducing feature-level terms enhances the robustness fairly compared to the model that only considers the pixel-level alignment. Moreover, the model can be further improved under cycle-consistent learning.

Instead of a simple dataset like MNIST, we evaluate our CAP-GAN on CIFAR-10 across various types of adversarial attacks. In addition, we carry out an adaptive attack in which the adversary exploits an internal weakness of the proposed method. Our model outperforms other pre-processing based defenses under both black-box and white-box settings, according to the experimental results. To sum it up, our main contributions are listed as follows:

  • To resist adversarial attacks, we propose a new pre-processing method, called CAP-GAN, considering both pixel-level and feature-level alignment under cycle-consistent learning.

  • Especially, we apply guided attention module and knowledge distillation to achieve feature-level consistency.

  • Our model outperforms other pre-processing methods and achieves promising robustness against various attack strategies under both black-box and white-box settings.

II Related Work

II-A Adversarial Attacks

Adversarial examples can be crafted by adjusting loss values of given images. After finding adversarial examples by [5], many researchers have developed various algorithms for crafting adversarial examples [4, 6, 19, 20]. Fast Sign Gradient Method (FGSM) [4], and Projected Gradient Descent (PGD) algorithm [6] are typically used for evaluating the robustness of the model. Both algorithms use the gradient of the loss function with respect to the input images using the following formula.

X0=X,Xi+1=Π(Xi+αsign(Xl(Xt,ytrue)))X_{0}=X,X_{i+1}=\Pi(X_{i}+\alpha\cdot sign(\triangledown_{X}l(X_{t},y_{true})))

Here α\alpha is a hyper-parameter, and Π\Pi is a projection function for adjusting perturbation size. In this formula, it is FGSM algorithm if ii is 1, and PGD uses more than 1. Carlini and Wagner (CW) attack approaches this problem as a constrained optimization problem and finds adversarial examples with the Lagrangian method, creating a more accurate and powerful adversarial examples [19]. Basically, these algorithms require the gradient of each objective function with respect to the inputs. Besides gradient-based attacks, gradient-free attacks such as SPSA [20], Square[21] and gradient-approximation attacks called Backward Pass Differentiable Approximation (BPDA) [22], are proposed. Despite the fact that these methods do not rely on the local gradient, many defenses are circumvented. These methods could be more potent than gradient-based attacks in terms of the applicability.

II-B Defense Mechanisms

Adversarial training Adversarial training [4] is training a model with specific adversarial examples. The model assumes the potential adversary and minimizes its loss concerning adversarial examples. Among those adversaries, PGD adversary has been widely chosen, and PGD-trained model shows outstanding robustness against various attack strategies advocated by [6]. Inspired by the success of PGD-trained model, TRADE [23] builds a more robust model by modifying the underlying loss function. Even though adversarial training achieves the decent robustness against various attack strategies, it sometimes requires tremendous training time to converge and may hurt generalization ability harshly. It may hinder its applicability in real-world applications.

Pre-processing Pre-processing or input transformation is transforming input images for the denoising purpose. Some papers [8, 9] focus on the high-level representation to minimize the distance between adversarial and clean examples. [7] borrows the idea of a statistical generative model, and the model reconstructs images to be clean-like images using pixelCNN architecture. [10] applies basic image transformations to remove the underlying adversarial perturbations. [13] and [11] use GAN architecture to purify the adversarial perturbations by projecting inputs into the legitimate distribution. All of these methods insist that their methods are robust against various adversarial examples. However, those pre-possessing methods are still vulnerable when the adversary can explore the defended model, called the white-box scenario. Even if the adversary does not know the defense methods, [22] shows that it is still vulnerable to the gradient-approximation attack. After the gradient-free attacks were introduced by [22] and [20], pre-processing has known to be relatively vulnerable against adversarial examples than adversarial training. Nevertheless, it is robust against various transfer-based attacks and achieves decent generalization ability.

III Method

We describe the methodology in detail. This section is organized as follows. We first present notations used in our work and describe the architecture of the purification model. We then explain each loss function and how it contributes to the training.

III-A Notation

Since our training method stems from CycleGAN[16], we have to construct two different domains: clean data domain and corresponding adversarial data domain. To be specific, we denote original training samples as clean since it was not affected by any attacks. Let 𝒞\mathcal{C} and 𝒜\mathcal{A} denote the domain with clean and adversarial data, respectively. A target classifier ff, which is trained with clean data xx from 𝒞\mathcal{C}, outputs a classification score over KK, where KK is the number of classes in 𝒞\mathcal{C}. Adversarial domain 𝒜\mathcal{A} consists of adversarial data xadvx_{adv} with small perturbation yet effective enough to deceive the target classifier ff. An adversary generates adversarial perturbations δ\delta against the inputs x[0,1]H×W×Cx\in[0,1]^{H\times W\times C} and combines the perturbations with the inputs to construct the adversarial inputs xadv=x+δx_{adv}=x+\delta s.t., f(x)f(xadv)f(x)\neq f(x_{adv}). An auxiliary classifier η\eta computes the probability to decide where the given inputs come from. For the purification, a generator 𝒯\mathcal{T} and a discriminator 𝒟\mathcal{D} learn the image-to-image translation between 𝒞\mathcal{C} and 𝒜\mathcal{A} so that achieving xxx^{\prime}\approx x as well as f(x)f(x)f(x^{\prime})\approx f(x), where x=𝒯(xadv)x^{\prime}=\mathcal{T}(x_{adv}). Note that we have two kinds of generator 𝒯\mathcal{T}: 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} and 𝒯𝒞𝒜\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}. 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} learns to map the inputs from 𝒜\mathcal{A} to 𝒞\mathcal{C}, and 𝒯𝒞𝒜\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}} is vice versa.

Refer to caption
Figure 2: Overview of architecture for generator 𝒯\mathcal{T} and discriminator 𝒟\mathcal{D}.

III-B Generative Attentional Purification with Cycle-consistent Learning

The main idea of our approach is initially originated from the image-to-image translation task[16, 24], where the main goal is training a model to map samples across two different domains correctly. We apply this concept to mitigate the risk of adversarial attacks. As we aim to train 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}, we describe the architecture from the perspective of 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}. 𝒯𝒞𝒜\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}} works in the same manner and is used for cycle-consistent learning to enhance 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}. The architecture is presented in Figure 2.

Generator We use the architecture introduced in [17, 16] and modify it for our task. Basically, the generator 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} consists of mainly three parts: an encoder E𝒯𝒜\mathrm{E}_{\mathcal{T}_{\mathcal{A}}}, an auxiliary classifier η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}}, and a decoder 𝒢𝒞\mathcal{G}_{\mathcal{C}}. The encoder E𝒯𝒜\mathrm{E}_{\mathcal{T}_{\mathcal{A}}} generates encoded feature maps from a batch of adversarial images xadvBx^{B}_{adv}, where BB is the size of batch. The global average pooling layer ψ\psi is then applied to generate spatial averages of feature maps from E𝒯𝒜\mathrm{E}_{\mathcal{T}_{\mathcal{A}}}, to use them as the inputs of the fully connected layer. Through the auxiliary classifier η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}}, we can compute the probability to determine whether given images (e.g., xadvx_{adv}) belong to a specific domain (e.g., 𝒜\mathcal{A}). As what follows, we leverage weights w𝒯𝒜w_{\mathcal{T}_{\mathcal{A}}}, which represent the importance with respect to xadvBx^{B}_{adv} computed by η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}}, to make the attention map M𝒯𝒜BM^{B}_{\mathcal{T}_{\mathcal{A}}} for the decoder 𝒢𝒞\mathcal{G}_{\mathcal{C}}, i.e., M𝒯𝒜B=k=1nw𝒯𝒜kψ(E𝒯𝒜k(xadvB))M^{B}_{\mathcal{T}_{\mathcal{A}}}=\sum^{n}_{k=1}w^{k}_{\mathcal{T}_{\mathcal{A}}}\cdot\psi(E^{k}_{\mathcal{T}_{\mathcal{A}}}(x^{B}_{adv})), where nn is the number of encoded feature maps. Considering η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}} learns domain-specific features, the activated map M𝒯𝒜BM^{B}_{\mathcal{T}_{\mathcal{A}}} can guide the decoder 𝒢𝒞\mathcal{G}_{\mathcal{C}} to focus on more meaningful regions, leading to encouraging valid source-to-target mapping. We will describe how η\eta works internally in Section III-C. The decoder 𝒢𝒞\mathcal{G}_{\mathcal{C}} considers M𝒯𝒜BM^{B}_{\mathcal{T}_{\mathcal{A}}} as the inputs and reconstructs the purified inputs x𝒜2𝒞Bx^{B}_{\mathcal{A}2\mathcal{C}} for the target domain 𝒞\mathcal{C}. Overall, the generator 𝒯\mathcal{T} takes xadvBx^{B}_{adv} and xBx^{B} to purify or perturb given images in the following manner:

𝒯𝒜𝒞(xadvB)=𝒢𝒞(k=1nw𝒯𝒜kψ(E𝒯𝒜k(xadvB)))=𝒢𝒞(M𝒯𝒜B)=x𝒜2𝒞B𝒯𝒞𝒜(xB)=𝒢𝒜(k=1nw𝒯𝒞kψ(E𝒯𝒞k(xB)))=𝒢𝒜(M𝒯𝒞B)=x𝒞2𝒜B𝒯𝒜𝒞(x𝒞2𝒜B)=𝒢𝒞(k=1nw𝒯𝒜kψ(E𝒯𝒜k(x𝒞2𝒜B)))=x𝒞2𝒜2𝒞B𝒯𝒞𝒜(x𝒜2𝒞B)=𝒢𝒜(k=1nw𝒯𝒞kψ(E𝒯𝒞k(x𝒜2𝒞B)))=x𝒜2𝒞2𝒜B\begin{split}&\scriptstyle\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x^{B}_{adv})=\mathcal{G}_{\mathcal{C}}(\sum_{k=1}^{n}w_{\mathcal{T}_{\mathcal{A}}}^{k}\cdot\psi(E^{k}_{\mathcal{T}_{\mathcal{A}}}(x^{B}_{adv})))=\mathcal{G}_{\mathcal{C}}({M^{B}_{\mathcal{T}_{\mathcal{A}}}})=x^{B}_{\mathcal{A}2\mathcal{C}}\\ &\scriptstyle\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}(x^{B})=\mathcal{G}_{\mathcal{A}}(\sum_{k=1}^{n}w_{\mathcal{T}_{\mathcal{C}}}^{k}\cdot\psi(E^{k}_{\mathcal{T}_{\mathcal{C}}}(x^{B})))=\mathcal{G}_{\mathcal{A}}({M^{B}_{\mathcal{T}_{\mathcal{C}}}})=x^{B}_{\mathcal{C}2\mathcal{A}}\\ &\scriptstyle\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x^{B}_{\mathcal{C}2\mathcal{A}})=\mathcal{G}_{\mathcal{C}}(\sum^{n}_{k=1}w_{\mathcal{T}_{\mathcal{A}}}^{k}\cdot\psi(E^{k}_{\mathcal{T}_{\mathcal{A}}}(x^{B}_{\mathcal{C}2\mathcal{A}})))=x^{B}_{\mathcal{C}2\mathcal{A}2\mathcal{C}}\\ &\scriptstyle\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}(x^{B}_{\mathcal{A}2\mathcal{C}})=\mathcal{G}_{\mathcal{A}}(\sum^{n}_{k=1}w_{\mathcal{T}_{\mathcal{C}}}^{k}\cdot\psi(E^{k}_{\mathcal{T}_{\mathcal{C}}}(x^{B}_{\mathcal{A}2\mathcal{C}})))=x^{B}_{\mathcal{A}2\mathcal{C}2\mathcal{A}}\end{split}

Apart from xadvBx^{B}_{adv} and xBx^{B}, the generator 𝒯\mathcal{T} takes translated images as inputs since we train the model under cycle-consistent learning.

Refer to caption
Figure 3: We visualize a few samples of each domain to describe the role of LCAML_{CAM}. For examples, xadvx_{adv} are fed into η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}} to compute the probability. Once η𝒯𝒜\eta_{\mathcal{T}_{\mathcal{A}}} has learned which regions represent 𝒜\mathcal{A}, its weight will highlight those regions (i.e., underlying noise) in the attention map and forward it to the decoder. As a result, the attention map would lead to the decoder to behave in a meaningful way (i.e., purifying the underlying noise).
Refer to caption
Figure 4: Entire training procedure under cycle-consistent learning. Note that 𝒞\mathcal{C} is represented using blue, and 𝒜\mathcal{A} is represented using red color.

Discriminator 𝒟𝒞\mathcal{D}_{\mathcal{C}} consists of three parts: an encoder E𝒟𝒞E_{\mathcal{D}_{\mathcal{C}}}, an auxiliary classifier η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}} and a convolutional classifier. The discriminator 𝒟𝒞\mathcal{D}_{\mathcal{C}} is responsible for deciding whether given images are valid or not using the auxiliary classifier η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}} and the convolutional classifier. Both classifiers update their loss toward distinguishing whether the given images are original or generated. While η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}} takes inputs as encoded feature maps generated from E𝒟𝒞E_{\mathcal{D}_{\mathcal{C}}}, the convolutional classifier takes inputs as attention maps M𝒟𝒞M_{\mathcal{D}_{\mathcal{C}}} activated by the weights of η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}}. η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}} will classify xx as the images belonging to 𝒞\mathcal{C} based on the probability, whereas x𝒜2𝒞x_{\mathcal{A}2\mathcal{C}} is considered as the images not belonging to 𝒞\mathcal{C}. η𝒟𝒞\eta_{\mathcal{D}_{\mathcal{C}}} has learned the domain-specific features, so activated regions on M𝒟𝒞M_{\mathcal{D}_{\mathcal{C}}} can be viewed as guided regions for improving the generator 𝒯\mathcal{T}. We elaborate on how to achieve the mapping between two different domains in the following section.

III-C Loss Functions

We introduce the objective function of CAP-GAN in detail. As we described before, our loss functions are designed with leveraging both pixel-level and feature-level approach. We use CycleGAN training procedure for pixel-level consistency (LGAN,LidentityL_{GAN},L_{identity}, and LcycL_{cyc}). For feature-level consistency, we use LcamL_{cam}, which encourages the model to learn domain-specific features, and LsemL_{sem}, which guides the generator to learn the semantic information by distilling the knowledge from the pre-trained model ff.

III-C1 Pixel-level Adaptation

Adversarial Loss

To get visually appealing results, we leverage GAN[15] training scheme. We apply Least Squared GAN (LSGAN) [25] objective function to ensure that purified images should reside in the vicinity of the original data distribution. Plus, LSGAN shows more stable learning. LGANL_{GAN} can be formulated as below:

LGAN=LGAN𝒯𝒜𝒞+LGAN𝒯𝒞𝒜=𝔼x𝒞[(𝒟𝒞(x))2]+𝔼xadv𝒜[(1(𝒟𝒞(𝒯𝒜𝒞(xadv))))2]+𝔼xadv𝒜[(𝒟𝒜(xadv))2]+𝔼x𝒞[(1(𝒟𝒜(𝒯𝒞𝒜(x))))2]\begin{split}L_{GAN}&=L^{\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}}_{GAN}+L^{\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}}_{GAN}\\ &=\scriptstyle\mathop{\mathbb{E}_{x\sim{\mathcal{C}}}}[(\mathcal{D}_{\small{}_{\mathcal{C}}}(x))^{2}]+\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}[(1-(\mathcal{D}_{\small{}_{\mathcal{C}}}(\mathcal{T}_{\small{}_{{\mathcal{A}\rightarrow\mathcal{C}}}}(x_{adv}))))^{2}]\\ &+\scriptstyle\mathop{\mathbb{E}_{x_{adv}\sim{\mathcal{A}}}}[(\mathcal{D}_{\small{}_{\mathcal{A}}}(x_{adv}))^{2}]+\mathop{\mathbb{E}_{x\sim\mathcal{C}}}[(1-(\mathcal{D}_{\small{}_{\mathcal{A}}}(\mathcal{T}_{\small{}_{{\mathcal{C}\rightarrow\mathcal{A}}}}(x))))^{2}]\end{split}
Identity Loss & Cycle Loss

Identity loss LidentityL_{identity} is able to help get visual satisfaction and plays a role in preserving color composition [16]. Given images xx, 𝒯𝒜𝒞(x)\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x) should not change xx because the decoder 𝒢𝒞\mathcal{G}_{\mathcal{C}} is supposed to learn to decode the images belonging to domain 𝒞\mathcal{C}. If the model does not consider LidentityL_{identity}, 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} will change non-essential points and affect both adversarial and cycle-consistency loss in unexpected ways [16].

Lidentity=Lidentity𝒯𝒜𝒞+Lidentity𝒯𝒞𝒜=𝔼x𝒞||𝒯𝒜𝒞(x)x||1+𝔼xadv𝒜||𝒯𝒞𝒜(xadv)xadv||1\begin{split}L_{identity}&=L^{\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}}_{identity}+L^{\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}}_{identity}\\ &=\scriptstyle\mathop{\mathbb{E}_{x\sim\mathcal{C}}||\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x)-x}||_{1}\\ &+\scriptstyle\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}||\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}(x_{adv})-x_{adv}}||_{1}\end{split}

Cycle-consistency loss LcycL_{cyc} is responsible for preserving the semantic information. Based on the objective function LcycL_{cyc}, transformed images should be able to revert back to source images. Such a constraint would tighten the relationship between two different domains and ensure that transformed images include more semantic information like structural features or local textures.

Lcyc=Lcyc𝒯𝒜𝒞+Lcyc𝒯𝒞𝒜=𝔼xadv𝒜𝒯C𝒜(𝒯𝒜𝒞(xadv))xadv1+𝔼x𝒞𝒯𝒜𝒞(𝒯𝒞𝒜(x))x1\begin{split}L_{cyc}&=L^{\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}}_{cyc}+L^{\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}}_{cyc}\\ &=\scriptstyle\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}||\mathcal{T}_{\mathcal{}{C}\rightarrow\mathcal{A}}(\mathcal{T_{\mathcal{A}\rightarrow\mathcal{C}}}({x_{adv}}))-x_{adv}||_{1}\\ &+\scriptstyle\mathop{\mathbb{E}_{x\sim\mathcal{C}}}||\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(\mathcal{T_{\mathcal{C}\rightarrow\mathcal{A}}}(x))-x||_{1}\end{split}

LidentityL_{identity} and LcycL_{cyc} contribute to mapping the two different distributions from the source domain to the target domain. We use L1L_{1} norm to compute each loss function because it shows better results than L2L_{2} norm in terms of both the accuracy and the quality of images.

III-C2 Feature-level Adaptation

Cam Loss

Motivated by [18] and [17], we apply an attention module in the intermediate layer of the generator. Class activation map (CAM) is originally designed to explain how the model works internally. We borrow this idea to guide the generator to figure out which regions need to be improved. Instead of the classification, we make the auxiliary classifier learning domain-specific features. By doing so, η𝒟\eta_{\mathcal{D}} can figure out whether given inputs include domain-specific features or not, and its weights are able to represent the domain properties. We then generate the attention map over the inputs using weights of η𝒟\eta_{\mathcal{D}}. According to the attention map generated by η𝒟\eta_{\mathcal{D}}, 𝒯\mathcal{T} tries to figure out which regions are important. For computing Lcam𝒟L^{\mathcal{D}}_{cam}, we use LSGAN training loss as LGANL_{GAN} used.

Lcam𝒟=Lcam𝒟𝒜+Lcam𝒟𝒞=𝔼xadv𝒜[(η𝒟𝒜(xadv))2]+𝔼x𝒞[(1η𝒟𝒜(𝒯𝒞𝒜(x)))2]+𝔼x𝒞[(η𝒟𝒞(x))2]+𝔼xadv𝒜[(1η𝒟𝒞(𝒯𝒜𝒞(xadv)))2]Lcam𝒯=Lcam𝒯𝒜𝒞+Lcam𝒯𝒞𝒜=𝔼xadv𝒜[log(η𝒯𝒜(xadv))]+𝔼x𝒞[log(1η𝒯𝒜(x))]𝔼x𝒞[log(η𝒯𝒞(x))]+𝔼xadv𝒜[log(1η𝒯𝒞(𝒯𝒞𝒜(xadv)))]Lcam=Lcam𝒟+Lcam𝒯\begin{split}L^{\mathcal{D}}_{cam}&=L^{\mathcal{D}_{\mathcal{A}}}_{cam}+L^{\mathcal{D}_{\mathcal{C}}}_{cam}\\ &=\scriptstyle\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}[(\eta_{\mathcal{D}_{\mathcal{A}}}(x_{adv}))^{2}]+{\mathop{\mathbb{E}_{x\sim\mathcal{C}}}[(1-\eta_{\mathcal{D}_{\mathcal{A}}}(\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}(x)))^{2}}]\\ &\scriptstyle\,+{\mathop{\mathbb{E}_{x\sim\mathcal{C}}[(\eta_{\mathcal{D}_{\mathcal{C}}}(x))^{2}}]}+\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}[(1-\eta_{\mathcal{D}_{\mathcal{C}}}(\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x_{adv})))^{2}]\\ L^{\mathcal{T}}_{cam}&=L^{\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}}_{cam}+L^{\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}}_{cam}\\ &=\scriptstyle-\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}[log(\eta_{\mathcal{T}_{\mathcal{A}}}(x_{adv}))]+\mathop{\mathbb{E}_{x\sim\mathcal{C}}}[log(1-\eta_{\mathcal{T}_{\mathcal{A}}}(x))]\\ &-\scriptstyle\mathop{\mathbb{E}_{x\sim\mathcal{C}}}[log(\eta_{\mathcal{T}_{\mathcal{C}}}(x))]+\mathop{\mathbb{E}_{x_{adv}\sim\mathcal{A}}}[log(1-\eta_{\mathcal{T}_{\mathcal{C}}}(\mathcal{T}_{\mathcal{C\rightarrow\mathcal{A}}}(x_{adv})))]\\ L_{cam}&=L^{\mathcal{D}}_{cam}+L^{\mathcal{T}}_{cam}\end{split}

Besides GAN training in LcamL_{cam}, we additionally put the term to encourage the generator 𝒯\mathcal{T} to recognize the source domain since it turns out that the discriminator easily overwhelms the generator in training. We perform binary classification for the auxiliary classifier η𝒯\eta_{\mathcal{T}} to decide where the images come from. For instance, the probability computed from η𝒯\eta_{\mathcal{T}}, which is the auxiliary classifier learning the domain 𝒜\mathcal{A}, should be closer to 1 if the inputs come from 𝒜\mathcal{A}, as opposed to 0 if the inputs come from 𝒞\mathcal{C}. We provide visual intuition about LcamL_{cam} in Figure 3.

Semantic Loss

CycleGAN is not designed for the classification purpose, but for visually appealing images in the pixel-level consistency, such as the image-to-image translation [16, 17]. Even if we can get visually similar results at the pixel-level, it does not mean that low-level appearance consistency can bring the semantic consistency like the classification accuracy (See Figure 1) [26]. With these perspectives, we design the generator to consider the semantic consistency by distilling the knowledge from the pre-trained model ff. In other words, we impose the semantic-level similarity between generated and original data in both domains.

We combine the knowledge of ff into 𝒯\mathcal{T} for providing better understanding of the feature-level consistency. That is, we conjecture that well-trained model ff can convey useful information to 𝒯\mathcal{T}, including the weakness of ff against the domain 𝒜\mathcal{A}. Instead of using ground-truth labels, we use soft-targets [27], where the class-wise scores are estimated based on a softmax function, by projecting the images on the pre-trained embedding space of ff. We adopt the concept of cycle-consistent learning to the semantic loss landscape as well. We compute the logit values, where z𝒞fake=f(x𝒞fake)z^{fake}_{\mathcal{C}}=f(x^{fake}_{\mathcal{C}}) and z𝒜fake=f(x𝒜fake)z^{fake}_{\mathcal{A}}=f(x^{fake}_{\mathcal{A}}), under cycle-consistent learning. x𝒞fakex^{fake}_{\mathcal{C}}and x𝒜fakex^{fake}_{\mathcal{A}} represent the generated images with respect to the domain 𝒞\mathcal{C} and 𝒜\mathcal{A} respectively, i.e., {x𝒜2𝒞,x𝒞2𝒞,x𝒞2𝒜2𝒞}{\{{x_{\mathcal{A}2\mathcal{C}},x_{\mathcal{C}2\mathcal{C}},x_{\mathcal{C}2\mathcal{A}2\mathcal{C}}\}}} and {x𝒞2𝒜,x𝒜2𝒜,x𝒜2𝒞2𝒜}\{{x_{\mathcal{C}2\mathcal{A}},x_{\mathcal{A}2\mathcal{A}},x_{\mathcal{A}2\mathcal{C}2\mathcal{A}}\}}. We match them to the soft-targets z𝒞real=f(x)z^{real}_{\mathcal{C}}=f(x) and z𝒜real=f(xadv)z^{real}_{\mathcal{A}}=f(x_{adv}), respectively. We can generally formulate the cycle-consistent semantic term as below:

Lsem=T2KL(p(z𝒞real,T),p(z𝒞fake,T))+T2KL(p(z𝒜real,T),p(z𝒜fake,T))\begin{split}L_{sem}&=T^{2}\cdot\mathrm{KL}(p(z^{real}_{\mathcal{C}},T),p(z^{fake}_{\mathcal{C}},T))\\ &+T^{2}\cdot\mathrm{KL}(p(z^{real}_{\mathcal{A}},T),p(z^{fake}_{\mathcal{A}},T))\end{split}

, where KL\mathrm{KL} is Kullback-Leibler divergence, and pp is the class-wise scores computed by a softmax function with a scaling temperature TT, p(zi,T)=exp(zi/T)jexp(zj/T)p(z_{i},T)=\frac{exp(z_{i}/T)}{\sum_{j}exp(z_{j}/T)}. TT is in charge of smoothing the distribution over the classes when logit zz is given. We will discuss how TT affects the performance in Section V.

III-C3 Entire Objective Function

To utilize both pixel-level and feature-level adaptation, CAP-GAN objective function is a weighted average of the loss functions. α\alpha is a balancing parameter to adjust the weights of LpixelL_{pixel} and LfeatureL_{feature} contribution, where LpixelL_{pixel} is LGAN+Lidentity+LcycL_{GAN}+L_{identity}+L_{cyc} and LfeatureL_{feature} is Lcam+LsemL_{cam}+L_{sem}. The pixel-level loss is conducive to preserve the structural shapes or the texture features, whereas the feature-level loss encourages the model to decode the images in a semantically meaningful way. We depict the entire training procedure in Figure 4.

LCAP=αLpixel+(1α)LfeatureL_{CAP}=\alpha\cdot L_{pixel}+(1-\alpha)\cdot L_{feature}

IV Experiments

We conduct the experiments to compare the robustness with other state-of-the-art defense methods. Note that we measure the robustness by the classification accuracy under ll_{\infty}-ball attacks, except for l2l_{2} CW attack with 1000 steps. We evaluate the robustness under both black-box and white-box settings with ϵ=8\epsilon=8. We then provide an ablation study to show how the performance can be affected by each hyper-parameter in Section IV-E.

IV-A Experiment Details

We train the proposed model on CIFAR-10 dataset with a single RTX TITAN 2080. We use a simple ResNet model as a victim model following [6]. We train the model up to 200 epochs using Adam optimizer (β\beta = (0.5,0.999)) and set batch size as 128. We use the initial learning rate 1e-4 and adopt cosine annealing learning rate decay for a stable training. To construct the adversarial domain 𝒜\mathcal{A}, we use FGSM (ϵ=8\epsilon=8) attack [4]. We set the α=0.7\alpha=0.7 and T=10.0T=10.0 to train the model. For a fair comparison, we implement all the defenses using the same codebase.

IV-B Evaluation on Black-Box Attacks

A black-box attack is an attack where an adversary has limited knowledge of the defense. This section assumes that a black-box adversary is not aware of the defense but has the knowledge of the target model ff. We evaluate the robustness of our defense against various attack strategies. We follow the black-box assumption introduced in [28] and use the same attack settings with [6]. Specifically, we provide the architecture of the target model ff to the adversary and train it with the same training data. With fully trained surrogate model ff^{\prime}, the adversary then generates transferable adversarial examples using gradient-based attacks such as FGSM, PGDn and CW(κ=20\kappa=20) strategy, where nn is the number of step, and κ\kappa is a confidence. Plus, to endanger the defense under a more harsh transfer-attack, we apply the momentum iterative FGSM, called MI-FGSM [29]. We use 20 iterative steps for MI-FGSM.

As shown in Table I, CAP-GAN shows promising results against transfer-based attacks and outperforms other pre-processing based methods, including the adversarial training(AT) model. We trained AT model using PGD7 adversary following [6]. Interestingly, FGSM works better than its iterative attack PGDn in APE, HGD, and CAP-GAN. Such an observation is on par with the tendency, where the gradient masking defense makes the iterative attacks getting stuck in a local minimum [22]. THM [30] achieves superb results in the first-step attack, such as FGSM attack, whereas it is relatively vulnerable to the iterative attacks, such as MI-FGSM and PGD. Especially, HGD obtains similar performance with CAP-GAN.

Even though we use the surrogate model ff^{\prime} with the same training settings, it cannot ensure that xadvx_{adv} generated from ff^{\prime} has transferability to ff [20]. We thus measured the attack success rate, which is referred to as the rate of misclassification against the target model ff when the attack strategy is applied, to judge whether the attack has enough capacity to deceive the model. We observed that generating the adversarial examples with ϵ=8\epsilon=8 is insufficient to assume the strong adversary. We conjectured that the surrogate model ff^{\prime} might not have a similar gradient direction of ff, implying that crafting the adversarial examples with the gradient direction of ff^{\prime} requires a large magnitude of perturbation to achieve a high attack success rate [28]. Therefore, we decided to evaluate the defenses under various magnitudes of the perturbation ϵ=[4,8,12,16,32]\epsilon=[4,8,12,16,32] to argue more clear robustness against various attack strategies. The results are presented in Figure 5.

Method Accuracy(%)
Clean FGSM MI-FGSM20 PGD7/40 CW
JPEG [31] 74.76 59.53 54.86 67.1/66.03 71.06
THM[30] 89.96 83.25 45.81 71.17/30.63 -
APE[13] 84.15 67.52 61.54 73.21/72.32 80.3
HGD [8] 87.68 82.35 80.37 84.55/84.39 85.36
AT[6] 78.1 77.1 76.9 77.49/77.46 78.36
CAP-GAN 85.62 82.73 81.96 84.22/84.43 84.78
TABLE I: Black-Box results for CIFAR-10: We train the surrogate model ff^{\prime} to generate transferable adversarial examples. Before feeding the adversarial examples into the target classifier ff, those examples are projected into 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}.
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 5: Black-box results for CIFAR-10: We vary the magnitude of the perturbation to examine the adversarial robustness. In all defenses, the degradation occurs as the magnitude of perturbation increases. We also provide the attack success rate (AT_success) to examine the capacity of the adversary.

IV-C Extensive Black-Box Attacks

Providing 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} for the adversary can cast the light for understanding effectiveness of the proposed defense. Besides gradient-based attacks, we also apply the score-based attack, SPSA [20] and Square attack [21] to launch the strong black-box attack. In SPSA and Square attack, the adversary is allowed to directly access the outputs of the defended model f(𝒯𝒜𝒞())f(\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(\cdot)). Since those gradient-free attacks do not rely on the local gradient information, we can provide a different aspect of the robustness. We set batch size as 2048 for SPSA to make it more potent. We mount Square attack varying the restart by 1 and 5.

Apart from gradient-free attacks, we also consider an adaptive adversary who exploits the fundamental weakness of the proposed model. We assume the adaptive adversary as BPDA-I attack, where the adversary crafts the corresponding adversarial examples using PGDn but backpropagating the gradients with the identity mapping. According to the experimental results from [22], many pre-processing based methods [7, 10, 30] are circumvented under BPDA-I attack. Note that BPDA attack is typically applied when the defense is non-differentiable, but we utilize this attack scheme to circumvent our defense as our model is designed to fulfill 𝒯𝒜𝒞(xadv)x\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x_{adv})\approx x and 𝒯𝒜𝒞(x)x\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x)\approx x simultaneously. As advocated by [22], BPDA attack can be applied to verify the vulnerability of the model even if the model is differentiable. We thus assume that the strong black-box adversary exploits the internal weakness 𝒯𝒜𝒞(x)x\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(x)\approx x by replacing the backward pass with the identity function to approximate the gradients over the inputs i.e., leveraging the concept that x𝒯(x)xx=1\nabla_{x}\mathcal{T}(x)\approx\nabla_{x}x=1, where x{𝒞,𝒜}x\in\{{\mathcal{C},\mathcal{A}}\}. The experimental results are presented in Table II. Our proposed model outperforms other methods across all attacks.

Method Accuracy(%)
SPSA Squarer=1/5 BPDA-I40
JPEG [31] 74.9 16.03/7.35 0.59
APE [13] 51.33 33.57/24.03 0.09
HGD [8] 31.26 23.88/14.26 71.78
CAP-GAN 79.28 53.07/43.6 78.27
TABLE II: Extensive Black-Box results for CIFAR-10: We provide the defense 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} to examine the robustness against the gradient-free and the adaptive attack. That is, the adversary can access the outputs of f(𝒯𝒜𝒞())f(\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(\cdot)).

IV-D Evaluation on White-Box Attacks

A white-box adversary has full knowledge of the target model and defended model to launch adversarial attacks. We consider two kinds of white-box settings: conventional white-box setting following [8, 13] and [7], where we provide the target model ff to the adversary, and end-to-end attack, where we allow the adversary to directly compute the gradient with respect to the combined model f(𝒯𝒜𝒞())f(\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(\cdot)) following [32] and [22].

White-box Attack In Table III, the target model reaches a nearly zero accuracy when we apply the white-box iterative attacks even with ϵ=8\epsilon=8. The accuracy of most methods significantly dropped compared to the black-box setting in Table I, whereas HGD and CAP-GAN show relatively consistent results. The intuition behind both models is using the feature-level consistency when optimizing the denoising loss function. However, models that try to align only pixel-level distribution cannot achieve the consistent robustness. This observation is on par with our assumption described in Section I, where the pixel-level consistency cannot assure the feature-level consistency. As observed by [8, 4], the residual perturbations continuously incur the error amplification effect in several intermediate layers, resulting in the misclassification at the prediction layer. The feature-level consistency additionally aligns the shift caused by the residual perturbations at the prediction layer, so it could help to achieve reasonable purification and leads to consistent performance.

Method Accuracy(%)
Clean FGSM MI-FGSM20 PGD7/40 CW
ResNet 92.8 28.81 0.94 0.56/0.01 0.28
JPEG [31] 74.76 47.37 34.33 53.07/50.7 68.23
THM[30] 89.96 66.25 20.01 53.82/10.74 -
APE[13] 84.15 49.53 30.76 54.94/55.32 72.07
HGD [8] 87.68 81.55 78.95 84.65/84.77 86.07
CAP-GAN 85.62 82.48 81.45 84.07/83.93 84.9
TABLE III: White-Box results for CIFAR-10: We provide trained weights of the target model ff to generate strong adversarial examples.

End-to-End Attack Since APE and HGD are differentiable, we pick those methods for the end-to-end attack to compare the robustness. Table IV shows the robustness under the end-to-end white-box setting. As expected, gradient-based attack is sufficient to circumvent APE and HGD. In contrast, adversarial training(AT) shows relatively consistent performance across all attack strategies. Overall, CAP-GAN surpasses the other differentiable pre-processing defenses, but is relatively less robust than AT in the end-to-end attack setting.

Method Accuracy(%)
Clean FGSM MI-FGSM20 PGD7/40
APE [13] 84.15 18.96 1.13 0.58/0.02
HGD [8] 87.68 26.6 0.63 0.67/0.01
AT [6] 78.1 58.46 49.54 48.09/45.22
CAP-GAN 85.62 53.11 19.24 25.17/6.58
TABLE IV: End-to-End Attack results for CIFAR-10: We provide the defense 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} to examine the robustness against gradient-based attacks. That is, the adversary crafts the adversarial examples using f(𝒯𝒜𝒞())f(\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}}(\cdot)).

IV-E Ablation Study

Parameter Settings We perform a simple ablation study to explain how each parameter affects the performance. CAP-GAN has two parameters: α\alpha and TT. α\alpha adjusts the contribution between pixel-level and feature-level consistency. In LsemL_{sem}, TT is a scaling parameter to smoothen the logits before feeding them into the softmax function. If we use a high value of TT, the logit space would be smoother, leading to generating soft-target scores over the inputs. According to Table V, we experimentally found that CAP-GAN obtains the best result with α=0.7\alpha=0.7 and T=10.0T=10.0. One can note that the accuracy under BPDA-I attack dropped as α\alpha increases, implying that x𝒯(x)xx\nabla_{x}\mathcal{T}(x)\approx\nabla_{x}x is closer to 1. This observation can provide an evidence for our assumption in which BPDA-I attack could be an adaptive adversary against CAP-GAN.

Model Accuracy(%)
Clean MI-FGSM20 PGD7/40 BPDA-I40
α,T=0.7,1.0\alpha,T=0.7,1.0 81.53 78.47 80.25/80.24 71.35
α,T=0.7,5.0\alpha,T=0.7,5.0 84.81 80.97 83.39/83.29 79.19
α,T=0.7,10.0\alpha,T=0.7,10.0 85.62 81.96 84.22/84.43 78.27
α,T=0.5,10.0\alpha,T=0.5,10.0 83.96 80.94 82.67/82.67 82.71
α,T=0.7,10.0\alpha,T=0.7,10.0 85.62 81.96 84.22/84.43 78.27
α,T=0.9,10.0\alpha,T=0.9,10.0 85.3 81.43 83.76/83.76 63.19
TABLE V: Ablation study for balancing parameters: We performed black-box attacks to evaluate each setting.

Loss Functions We introduce how each loss function contributes internally. Besides the pixel-level alignment with LpixelL_{pixel}, we explicitly add the feature-level terms Lfeature=Lcam+LsemL_{feature}=L_{cam}+L_{sem} to tune the generator. LcamL_{cam} helps the model to understand domain-specific features, and LsemL_{sem} conveys distilled knowledge from the pre-trained model to guide 𝒯\mathcal{T}. Furthermore, we adopt CycleGAN training scheme for enhancing the expected result. As shown in Table VI, the feature-level terms lead to the improvement in the adversarial robustness across all attacks. We initially conjecture that LcamL_{cam} enhances the purification by encouraging the valid source to target mapping, but rather degrades the performance. When we combine the model with the semantic term LsemL_{sem}, however, the model significantly improves its capacity. We also found that simply combining the two loss terms would not derive the positive effects. At this point, cycle-consistent learning is an important factor in deriving the expected improvement. This is because 𝒯𝒜𝒞\mathcal{T}_{\mathcal{A}\rightarrow\mathcal{C}} takes the inputs from not only domain 𝒜\mathcal{A} but also 𝒯𝒞𝒜(x)=x𝒞2𝒜\mathcal{T}_{\mathcal{C}\rightarrow\mathcal{A}}(x)=x_{\mathcal{C}2\mathcal{A}} to fulfill cycle-consistency. Cycle-consistent learning can regularize the entire training with more diverse inputs. Thus, additional inputs convey the meaningful hidden information to the model, encouraging the model to understand the underlying relationship between two different domains.

Model Accuracy(%)
Clean MI-FGSM20 PGD7/40 BPDA-I40
LpixelL_{pixel} 90.5 57.03 77.7/77.67 0.34
LpixelL_{pixel} + LcamL_{cam} 88.0 50.9 75.13/75.05 0.2
LpixelL_{pixel} + LsemL_{sem} 83.53 78.79 81.64/81.75 38.71
CAP-GAN w/o Cyc 82.39 78.58 80.52/80.67 58.18
CAP-GAN 85.62 81.96 84.22/84.43 78.27
TABLE VI: Ablation study for loss terms: LpixelL_{pixel} means the pixel-level loss terms which consist of LGAN,LidentityL_{GAN},L_{identity}, and LcycL_{cyc}. CAP-GAN w/o Cyc indicates the model without cycle-consistent learning in both pixel-level and feature-level.

V Conclusion

We propose CAP-GAN to purify the adversarial examples leveraging both pixel-level and feature-level adaptation. While the previous approaches construct task-specific functions to have feature-level consistency implicitly, we explicitly introduce the additional guided terms to achieve the feature-level consistency. Even more, we apply cycle-consistent learning to regularize the model to be more effective. CAP-GAN achieves the adversarial robustness against various attack strategies and outperforms other pre-processing methods under both black-box and white-box settings.

Acknowledgement

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the Grand Information Technology Research Center support program(IITP-2021-2020-0-01489) supervised by the IITP(Institute for Information & communications Technology Planning & Evaluation), and the Technology Innovation Program(or Industrial Strategic Technology development Program, 2000682, Development of Automated Driving Systems and Evaluation)funded by the Ministry of Trade, Industry & Energy(MOTIE,Korea).

References

  • [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [2] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
  • [3] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
  • [4] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [5] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
  • [6] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
  • [7] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, “Pixeldefend: Leveraging generative models to understand and defend against adversarial examples,” arXiv preprint arXiv:1710.10766, 2017.
  • [8] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu, “Defense against adversarial attacks using high-level representation guided denoiser,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1778–1787.
  • [9] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He, “Feature denoising for improving adversarial robustness,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 501–509.
  • [10] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Countering adversarial images using input transformations,” arXiv preprint arXiv:1711.00117, 2017.
  • [11] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-gan: Protecting classifiers against adversarial attacks using generative models,” arXiv preprint arXiv:1805.06605, 2018.
  • [12] C. Xiao and C. Zheng, “One man’s trash is another man’s treasure: Resisting adversarial examples by adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 412–421.
  • [13] S. Shen, G. Jin, K. Gao, and Y. Zhang, “Ape-gan: Adversarial perturbation elimination with gan,” arXiv preprint arXiv:1707.05474, 2017.
  • [14] M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli, “A self-supervised approach for adversarial robustness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [17] J. Kim, M. Kim, H. Kang, and K. Lee, “U-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation,” arXiv preprint arXiv:1907.10830, 2019.
  • [18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.
  • [19] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp).   IEEE, 2017, pp. 39–57.
  • [20] J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli, “Adversarial risk and the dangers of evaluating against weak attacks,” arXiv preprint arXiv:1802.05666, 2018.
  • [21] M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” in European Conference on Computer Vision.   Springer, 2020, pp. 484–501.
  • [22] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” arXiv preprint arXiv:1802.00420, 2018.
  • [23] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan, “Theoretically principled trade-off between robustness and accuracy,” arXiv preprint arXiv:1901.08573, 2019.
  • [24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  • [25] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
  • [26] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning.   PMLR, 2018, pp. 1989–1998.
  • [27] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [28] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” arXiv preprint arXiv:1611.02770, 2016.
  • [29] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193.
  • [30] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometer encoding: One hot way to resist adversarial examples,” in International Conference on Learning Representations, 2018.
  • [31] G. K. Dziugaite, Z. Ghahramani, and D. M. Roy, “A study of the effect of jpg compression on adversarial images,” arXiv preprint arXiv:1608.00853, 2016.
  • [32] V. Zantedeschi, M.-I. Nicolae, and A. Rawat, “Efficient defenses against adversarial attacks,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017, pp. 39–49.