This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AdvFlow: Inconspicuous Black-box Adversarial Attacks using Normalizing Flows

Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie
School of Computing and Information Systems
The University of Melbourne
Parkville, Victoria, Australia
[email protected]
Abstract

Deep learning classifiers are susceptible to well-crafted, imperceptible variations of their inputs, known as adversarial attacks. In this regard, the study of powerful attack models sheds light on the sources of vulnerability in these classifiers, hopefully leading to more robust ones. In this paper, we introduce AdvFlow: a novel black-box adversarial attack method on image classifiers that exploits the power of normalizing flows to model the density of adversarial examples around a given target image. We see that the proposed method generates adversaries that closely follow the clean data distribution, a property which makes their detection less likely. Also, our experimental results show competitive performance of the proposed approach with some of the existing attack methods on defended classifiers. The code is available at https://github.com/hmdolatabadi/AdvFlow.

1 Introduction

Deep neural networks (DNN) have been successfully applied to a wide variety of machine learning tasks. For instance, trained neural networks can reach human-level accuracy in image classification [46]. However, Szegedy et al., [53] showed that such classifiers can be fooled by adding an imperceptible perturbation to the input image. Since then, there has been extensive research in this area known as adversarial machine learning, trying to design more powerful attacks and devising more robust neural networks. Today, this area encompasses a broader type of data than images, with video [25], graphs [65], text [34], and other types of data classifiers being attacked.

In this regard, the design of stronger adversarial attacks plays a crucial role in understanding the nature of possible real-world threats. The ultimate goal of such studies is to help neural networks become more robust against such adversaries. This line of research is extremely important as even the slightest flaw in some real-world applications of DNNs such as self-driving cars can have severe, irreparable consequences [12].

In general, adversarial attack approaches can be classified into two broad categories: white-box and black-box. In white-box adversarial attacks, the assumption is that the threat model has full access to the target DNN. This way, adversaries can leverage their knowledge about the target model to generate adversarial examples (for instance, by taking the gradient of the neural network). In contrast, black-box attacks assume that they do not know the internal structure of the target model a priori. Instead, they can only query the model about some inputs, and work with the labels or confidence levels associated with them [62]. Thus, black-box attacks seem to be making more realistic assumptions. In the beginning, black-box attacks were mostly thought of as the transferability of white-box adversarial examples to unseen models [42]. Recently, however, there has been more research to attack black-box models directly.

In this paper, we introduce AdvFlow: a black-box adversarial attack that makes use of pre-trained normalizing flows to generate adversarial examples. In particular, we utilize flow-based methods pre-trained on clean data to model the probability distribution of possible adversarial examples around a given image. Then, by exploiting the notion of search gradients from natural evolution strategies (NES) [59, 58], we solve the black-box optimization problem associated with adversarial example generation to adjust this distribution. At the end of this process, we wind up having a data distribution whose realizations are likely to be adversarial. Since this density is constructed on the top of the original data distribution estimated by normalizing flows, we see that the generated perturbations take on the structure of data rather than an additive noise (see Figure 1). This property impedes distinguishing AdvFlow examples from clean data for adversarial example detectors, as they often assume that the adversaries come from a different distribution than the clean data. Moreover, we prove a lemma to conclude that adversarial perturbations generated by the proposed approach can be approximated by a normal distribution with dependent components. We then put our model under test and show its effectiveness in generating adversarial examples with 1) less detectability, 2) higher success rate, 3) lower number of queries, and 4) higher rate of transferability on defended models compared to the similar method of 𝒩\mathcal{N}Attack [33].

In summary, we make the following contributions:

  • We introduce AdvFlow, a black-box adversarial attack that leverages the power of normalizing flows in modeling data distributions. To the best of our knowledge, this is the first work that explores the use of flow-based models in the design of adversarial attacks.

  • We prove a lemma about the adversarial perturbations generated by AdvFlow. As a result of this lemma, we deduce that AdvFlows can generate perturbations with dependent elements, while this is not the case for 𝒩\mathcal{N}Attack [33].

  • We show the power of the proposed approach in generating adversarial examples that have a similar distribution to the data. As a result, our method is able to mislead adversarial example detectors for they often assume adversaries come from a different distribution than the clean data. We then see the performance of the proposed approach in attacking some of the most recent adversarial training defense techniques.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 1: Adversarial perturbations generated by AdvFlow take the structure of the original image into account, resulting in less detectable adversaries compared to 𝒩\mathcal{N}Attack [33] (see Section 4.1). The classifier is a VGG19 [50] trained to detect smiles in CelebA [36] faces. (a) AdvFlow magnified difference (b) AdvFlow adversarial example (c) clean image (d) 𝒩\mathcal{N}Attack adversarial example (e) 𝒩\mathcal{N}Attack magnified difference.

2 Related Work

In this section, we review some of the most closely related work to our proposed approach. For a complete review of (black-box) adversarial attacks, we refer the interested reader to [62, 3].

Black-box Adversarial Attacks.

In one of the earliest black-box approaches, Chen et al., [5] used the idea of Zeroth Order Optimization and came up with a method called ZOO. In particular, ZOO uses the target neural network queries to build up a zero-order gradient estimator. Then, it utilizes the estimated gradient to minimize a Carlini and Wagner (C&W) loss [4] and find an adversarial image. Later and inspired by [58, 47], Ilyas et al., [22] tried to estimate the DNN gradient using a normally distributed search density. In particular, they estimate the gradient of the classifier 𝒞(𝐱){\mathcal{C}(\mathbf{x})} with

𝐱𝒞(𝐱)𝔼𝒩(𝐳|𝐱,σ2I)[𝒞(𝐳)𝐱log(𝒩(𝐳|𝐱,σ2I))],\nabla_{\mathbf{x}}\mathcal{C}(\mathbf{x})\approx\mathbb{E}_{\mathcal{N}(\mathbf{z}|\mathbf{x},\sigma^{2}I)}\left[\mathcal{C}(\mathbf{z})\nabla_{\mathbf{x}}\log\big{(}\mathcal{N}(\mathbf{z}|\mathbf{x},\sigma^{2}I)\big{)}\right],

which only requires querying the black-box model 𝒞(𝐱){\mathcal{C}(\mathbf{x})}. Having the DNN gradient estimate, Ilyas et al., [22] then take a projected gradient descent (PGD) step to minimize their objective for generating an adversarial example. This idea is further developed in the construction of 𝒩\mathcal{N}Attack [33]. Specifically, instead of trying to minimize the adversarial example generation objective directly, they aim to fit a distribution around the clean data so that its realizations are likely to be adversarial (see Section 3.3 for more details). In another piece of work, Ilyas et al., [23] observe that the gradients used in adversarial example generation by PGD exhibit a high correlation both in time and across data. Thus, the number of queries to attack a black-box model can be reduced if one incorporates this prior knowledge about the gradients. To this end, Ilyas et al., [23] uses a bandit-optimization technique to integrate these priors into their attack, resulting in a method called Bandits & Priors. Finally, Simple Black-box Attack (SimBA) [16] is a straightforward, intuitive approach to construct black-box adversarial examples. It is first argued that for any particular direction 𝐪\mathbf{q} and step size ϵ>0\epsilon>0, either 𝐱ϵ𝐪\mathbf{x}-\epsilon\mathbf{q} or 𝐱+ϵ𝐪\mathbf{x}+\epsilon\mathbf{q} is going to decrease the probability of detecting the correct class label of the input image 𝐱\mathbf{x}. Thus, we are likely to find an adversary by iteratively taking such steps. The vectors 𝐪\mathbf{q} are selected from a set of orthonormal candidate vectors QQ. Guo et al., 2019b [16] use Discrete Cosine Transform (DCT) to construct such a set, exploiting the observation that “random noise in low-frequency space is more likely to be adversarial” [15].

Adversarial Attacks using Generative Models.

There has been some prior work that utilizes the power of generative models (mostly generative adversarial networks (GAN)) to model adversarial perturbations and attack DNNs [2, 61, 56, 20]. The target of these models is mainly white-box attacks. They require training of their parameters to produce adversarial perturbations using a cost function that involves taking the gradient of a target network. To adapt themselves to black-box settings, they try to replace this target network with either a distilled version of it [61], or a substitute source model [20]. However, as we will see in Section 3, the flow-based part of our model is only pre-trained on some clean training data using the maximum likelihood objective of Eq. (5). Thus, AdvFlow can be adapted to any target classifier of the same dataset, without the need to train it again. Moreover, while prior work is mainly concerned with generating the adversarial perturbations (for example [55]), here we use the normalizing flows output as the adversarial example directly. In this sense, our work is more similar to [51] that generates unrestricted adversarial examples using GANs in a white-box setting, and falls under functional adversarial attacks [31]. However, besides being black-box, in AdvFlow we restrict the output to be in the vicinity of the original image.

3 Proposed Method

In this section, we propose our attack method. First, we define the problem of black-box adversarial attacks formally. Next, we go over normalizing flows and see how we can train a flow-based model. Then, we review the idea of Natural Evolution Strategies (NES) [59, 58] and 𝒩\mathcal{N}Attack [33]. Afterward, we show how normalizing flows can be mixed with NES in the context of black-box adversarial attacks, resulting in a method we call AdvFlow. Finally, we prove a lemma about the nature of the perturbations generated by the proposed approach and show that 𝒩\mathcal{N}Attack cannot produce the adversarial perturbations generated by AdvFlow. Our results in Section 4 support this lemma. There, we see that AdvFlow can generate adversarial examples that are less detectable than the ones generated by 𝒩\mathcal{N}Attack [33] due to its perturbation structure.

3.1 Problem Statement

Let 𝒞():𝒳d𝒫k{\mathcal{C}(\cdot):\mathcal{X}^{d}\rightarrow\mathcal{P}^{k}} denote a DNN classifier. Assume that the classifier takes a d{d}-dimensional input 𝐱𝒳d{\mathbf{x}\in\mathcal{X}^{d}}, and outputs a vector 𝐩𝒫k{\mathbf{p}\in\mathcal{P}^{k}}. Each element of the vector 𝐩{\mathbf{p}} indicates the probability of the input belonging to one of the k{k} classes that the classifier is trying to distinguish. Furthermore, let y{y} denote the correct class label of the data. In other words, if the y{y}-th element of the classifier output 𝐩{\mathbf{p}} is larger than the rest, then the input has been correctly classified. Finally, let the well-known Carlini and Wagner (C&W) loss [4] be defined as111 Note that although we are defining our objective function (𝐱)\mathcal{L}(\mathbf{x}^{\prime}) for un-targeted adversarial attacks, it can be easily modified to targeted attacks. To this end, it suffices to replace maxcylog𝒞(𝐱)c{\max_{c\neq y}\log\mathcal{C}(\mathbf{x}^{\prime})_{c}} in Eq. (1) with log𝒞(𝐱)t{\log\mathcal{C}(\mathbf{x}^{\prime})_{t}}, where t{t} shows the target class output. In this paper, we only consider un-targeted attacks.

(𝐱)=max(0,log𝒞(𝐱)ymaxcylog𝒞(𝐱)c),\mathcal{L}(\mathbf{x}^{\prime})=\max\big{(}0,\log\mathcal{C}(\mathbf{x}^{\prime})_{y}-\max_{c\neq y}\log\mathcal{C}(\mathbf{x}^{\prime})_{c}\big{)}, (1)

where 𝒞(𝐱)y{\mathcal{C}(\mathbf{x}^{\prime})_{y}} indicates the y{y}-th element of the classifier output. In the C&W objective, we always have (𝐱)0{\mathcal{L}(\mathbf{x}^{\prime})\geq 0}. The minimum occurs when 𝒞(𝐱)ymaxcy𝒞(𝐱)c{\mathcal{C}(\mathbf{x}^{\prime})_{y}\leq\max_{c\neq y}\mathcal{C}(\mathbf{x}^{\prime})_{c}}, which is an indication that our classifier has been fooled. Thus, finding an adversarial example for the input data 𝐱{\mathbf{x}} can be written as [33]:

𝐱adv=argmin𝐱𝒮(𝐱)(𝐱).{\mathbf{x}}_{adv}=\operatorname*{arg\,min}_{\mathbf{x}^{\prime}\in\mathcal{S}(\mathbf{x})}\mathcal{L}(\mathbf{x}^{\prime}). (2)

Here, 𝒮(𝐱){\mathcal{S}(\mathbf{x})} denotes a set that contains similar data to 𝐱{\mathbf{x}} in an appropriate manner. For example, it is common to define

𝒮(𝐱)={𝐱𝒳d|𝐱𝐱pϵmax}\mathcal{S}(\mathbf{x})=\left\{\mathbf{x}^{\prime}\in\mathcal{X}^{d}~{}\big{\rvert}~{}\left\lVert\mathbf{x}^{\prime}-\mathbf{x}\right\rVert_{p}\leq\epsilon_{\max}\right\} (3)

for image data. In this paper, we define 𝒮(𝐱){\mathcal{S}(\mathbf{x})} as in Eq. (3) since we deal with the application of our attack on images.

3.2 Flow-based Modeling

Normalizing Flows.

Normalizing flows (NF) [54, 7, 44] are a family of generative models that aim at modeling the probability distribution of a given dataset. To this end, they make use of the well-known change of variables formula. In particular, let 𝐙d{\mathbf{Z}\in\mathbb{R}^{d}} denote a random vector with a straightforward, known distribution such as uniform or standard normal. The change of variables formula states that if we apply an invertible and differentiable transformation 𝐟():dd{\mathbf{f}\left(\cdot\right):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}} on 𝐙{\mathbf{Z}} to obtain a new random vector 𝐗d{\mathbf{X}\in\mathbb{R}^{d}}, the relationship between their corresponding distributions can be written as:

p(𝐱)=p(𝐳)|det(𝐟𝐳)|1.p(\mathbf{x})=p(\mathbf{z})\left|\mathrm{det}\Big{(}\dfrac{\partial\mathbf{f}}{\partial\mathbf{z}}\Big{)}\right|^{-1}. (4)

Here, p(𝐱){p(\mathbf{x})} and p(𝐳){p(\mathbf{z})} denote the probability distributions of 𝐗{\mathbf{X}} and 𝐙{\mathbf{Z}}, respectively. Moreover, the multiplicative term on the right-hand side is called the absolute value of the Jacobian determinant. This term accounts for the changes in the volume of 𝐙{\mathbf{Z}} due to applying the transformation 𝐟(){\mathbf{f}(\cdot)}. Flow-based methods model the transformation 𝐟(){\mathbf{f}(\cdot)} using stacked layers of invertible neural networks (INN). They then apply this transformation on a base random vector 𝐙{\mathbf{Z}} to model the data density. In this paper, we assume that the base random vector has a standard normal distribution.

Maximum Likelihood Estimation.

To fit the parameters of INNs to the i.i.d. data observations 𝐱1,𝐱2,,𝐱n{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}}, NFs use the following maximum likelihood objective [44]:

𝜽=argmax𝜽1ni=1nlogp𝜽(𝐱i).{\boldsymbol{\theta}}^{*}=\operatorname*{arg\,max}_{\boldsymbol{\theta}}\dfrac{1}{n}\sum_{i=1}^{n}\log p_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right). (5)

Here, 𝜽{\boldsymbol{\theta}} denotes the parameter set of the model and p𝜽p_{\boldsymbol{\theta}} is the density defined in Eq. (4). Note that INNs should be modeled such that they allow for efficient computation of their Jacobian determinant. Otherwise, this issue can impose a severe hindrance in the application of NFs to high-dimensional data given the cubic complexity of determinant computation. For a more detailed review of normalizing flows, we refer the interested reader to [41, 28] and the references within.

Training the Flow-based Models.

We assume that we have access to some training data of the same domain to pre-train our flow-based model. However, at the time of adversarial example generation, we use unseen test data. Note that while in our experiments we use the same training data as the classifier itself, we are not obliged to do so. We observed that our results remain almost the same even if we separate the flow-based model training data from what the classifier is trained on. We argue that using the same training data is valid since our flow-based models do not extract discriminative features, and they are only trained on clean data. This is in contrast to other generative approaches used for adversarial example generation [2, 61, 56, 20]. In Appendix C.5, we empirically show that not only is this statement accurate, but we can get almost the same performance by using similar datasets to the true one.

3.3 Natural Evolution Strategies and 𝒩{\mathcal{N}}Attack

Natural Evolution Strategies (NES).

Our goal is to solve the optimization problem of Eq. (2) in a black-box setting, meaning that we only have access to the inputs and outputs of the classifier 𝒞(){\mathcal{C}(\cdot)}. Natural Evolution Strategies (NES) use the idea of search gradients to optimize Eq. (2[59, 58]. To this end, a so-called search distribution is first defined, and then the expected value of the original objective is optimized under this distribution.

In particular, let p(𝐱|𝝍){p(\mathbf{x^{\prime}}|\boldsymbol{\psi})} denote the search distribution with parameters 𝝍{\boldsymbol{\psi}}. Then, in NES we aim to minimize

J(𝝍)=𝔼p(𝐱|𝝍)[(𝐱)]J(\boldsymbol{\psi})=\mathbb{E}_{p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}\left[\mathcal{L}(\mathbf{x}^{\prime})\right] (6)

over 𝝍{\boldsymbol{\psi}} as a surrogate for (𝐱){\mathcal{L}(\mathbf{x}^{\prime})} [58]. To minimize Eq. (6) using gradient descent, one needs to compute the Jacobian of J(𝝍){J(\boldsymbol{\psi})} with respect to 𝝍{\boldsymbol{\psi}}. To this end, NES makes use of the “log-likelihood trick” [58] (see Appendix A.1)

𝝍J(𝝍)=𝔼p(𝐱|𝝍)[(𝐱)𝝍log(p(𝐱|𝝍))].\displaystyle\nabla_{\boldsymbol{\psi}}J(\boldsymbol{\psi})=\mathbb{E}_{p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}\left[\mathcal{L}(\mathbf{x}^{\prime})\nabla_{\boldsymbol{\psi}}\log\big{(}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\big{)}\right]. (7)

Finally, the parameters of the model are updated using a gradient descent step with learning rate α{\alpha}:222Note that this update procedure is not what NES precisely stands for. It is rather a canonical gradient search algorithm as is called by Wierstra et al., [58], which only makes use of a vanilla gradient [59] for evolution strategies. In fact, the natural term in natural evolution strategies represents an update of the form 𝝍𝝍α~𝝍J(𝝍){\boldsymbol{\psi}\leftarrow\boldsymbol{\psi}-\alpha\tilde{\nabla}_{\boldsymbol{\psi}}J(\boldsymbol{\psi})}, where ~𝝍J(𝝍)=𝐅1𝝍J(𝝍){\tilde{\nabla}_{\boldsymbol{\psi}}J(\boldsymbol{\psi})=\mathbf{F}^{-1}\nabla_{\boldsymbol{\psi}}J(\boldsymbol{\psi})} is called the natural gradient. Here, the matrix 𝐅{\mathbf{F}} is the Fischer information matrix of the search distribution p(𝐱|𝝍){p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}. However, since NES in the adversarial learning literature [22, 23, 33] points to Eq. (8), we use the same convention here.

𝝍𝝍α𝝍J(𝝍).\boldsymbol{\psi}\leftarrow\boldsymbol{\psi}-\alpha\nabla_{\boldsymbol{\psi}}J(\boldsymbol{\psi}). (8)
𝒩{\mathcal{N}}Attack.

To find an adversarial example for an input 𝐱{\mathbf{x}}, 𝒩\mathcal{N}Attack [33] tries to find a distribution p(𝐱|𝝍){p(\mathbf{x^{\prime}}|\boldsymbol{\psi})} over the set of legitimate adversaries 𝒮(𝐱){\mathcal{S}(\mathbf{x})} in Eq. (3). Therefore, it models 𝐱\mathbf{x}^{\prime} as

𝐱=proj𝒮(12(tanh(𝐳)+1)),\mathbf{x}^{\prime}=\mathrm{proj}_{\mathcal{S}}\big{(}\tfrac{1}{2}(\tanh(\mathbf{z})+1)\big{)}, (9)

where 𝐳𝒩(𝐳|𝝁,σ2I)\mathbf{z}\sim\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I) is an isometric normal distribution with mean 𝝁\boldsymbol{\mu} and standard deviation σ\sigma. Moreover, proj𝒮()\mathrm{proj}_{\mathcal{S}}(\cdot) projects its input back into the set of legitimate adversaries 𝒮(𝐱){\mathcal{S}(\mathbf{x})}. Li et al., [33] define their model parameters as 𝝍={𝝁,σ}{\boldsymbol{\psi}=\{\boldsymbol{\mu},\sigma\}}. Then, they find σ{\sigma} using grid-search and 𝝁{\boldsymbol{\mu}} by the update rule of Eq. (8) exploiting NES.

3.4 Our Approach: AdvFlow

Recently, there has been some effort to detect adversarial examples from clean data. The primary assumption of these methods is often that the adversaries come from a different distribution than the data itself; for instance, see [37, 32, 64]. Thus, to come up with more powerful adversarial attacks, it seems reasonable to construct adversaries that have a similar distribution to the clean data. To this end, we propose AdvFlow: a black-box adversarial attack that seeks to build inconspicuous adversaries by leveraging the power of normalizing flows (NF) in exact likelihood modeling of the data [8].

Let 𝐟(){\mathbf{f}(\cdot)} denote a pre-trained, invertible and differentiable NF model on the clean training data. To reach our goal of decreasing the attack’s detectability, we propose using this pre-trained, fixed NF transformation to model the adversaries. In an analogy with Eq. (9), we assume that our adversarial example comes from a distribution that is modeled by

𝐱=proj𝒮(𝐟(𝐳)),𝐳𝒩(𝐳|𝝁,σ2I)\mathbf{x}^{\prime}=\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z})\big{)},\qquad\mathbf{z}\sim\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I) (10)

where proj𝒮()\mathrm{proj}_{\mathcal{S}}(\cdot) is a projection rule that keeps the generated examples in the set of legitimate adversaries 𝒮(𝐱){\mathcal{S}(\mathbf{x})}. By the change of variables formula from Eq. (4), we know that 𝐟(𝐳){\mathbf{f}(\mathbf{z})} in Eq. (10) is distributed similar to the clean data distribution. The only difference is that the base density is transformed by an affine mapping, i.e., from 𝒩(𝐳|𝟎,I)\mathcal{N}(\mathbf{z}|\mathbf{0},I) to 𝒩(𝐳|𝝁,σ2I)\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I). This small adjustment can result in an overall distribution for which the generated samples are likely to be adversarial.

Putting the rule of the lazy statistician [57] together with our attack definition in Eq. (10), we can write down the objective function of Eq. (6) as

J(𝝁,σ)=𝔼p(𝐱|𝝁,σ)[(𝐱)]=𝔼𝒩(𝐳|𝝁,σ2I)[(proj𝒮(𝐟(𝐳)))].J(\boldsymbol{\mu},\sigma)=\mathbb{E}_{p(\mathbf{x^{\prime}}|\boldsymbol{\mu},\sigma)}\left[\mathcal{L}(\mathbf{x}^{\prime})\right]=\mathbb{E}_{\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I)}\left[\mathcal{L}\bigg{(}\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z})\big{)}\bigg{)}\right]. (11)

As in 𝒩\mathcal{N}Attack [33], we will consider σ{\sigma} to be a hyperparameter.333Indeed, we can also optimize σ\sigma alongside 𝝁{\boldsymbol{\mu}} to enhance the attack strength. However, since 𝒩\mathcal{N}Attack [33] only optimizes 𝝁{\boldsymbol{\mu}}, we also stick with the same setting. Thus, we are only required to minimize J(𝝁,σ){J(\boldsymbol{\mu},\sigma)} with respect to 𝝁{\boldsymbol{\mu}}. Using the “log-likelihood trick” of Eq. (7), we can derive the Jacobian of J(𝝁,σ){J(\boldsymbol{\mu},\sigma)} as

𝝁J(𝝁,σ)=𝔼𝒩(𝐳|𝝁,σ2I)[(proj𝒮(𝐟(𝐳)))𝝁log𝒩(𝐳|𝝁,σ2I)].\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma)=\mathbb{E}_{\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I)}\left[\mathcal{L}\bigg{(}\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z})\big{)}\bigg{)}\nabla_{\boldsymbol{\mu}}\log\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I)\right]. (12)

This expectation can then be estimated by sampling from a distribution 𝒩(𝐳|𝝁,σ2I){\mathcal{N}(\mathbf{z}|\boldsymbol{\mu},\sigma^{2}I)} and forming their sample average. Next, we update the parameter 𝝁{\boldsymbol{\mu}} by performing a gradient descent step

𝝁𝝁α𝝁J(𝝁,σ).\boldsymbol{\mu}\leftarrow\boldsymbol{\mu}-\alpha\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma). (13)

In the end, we generate our adversarial example by sampling from Eq. (10).

Practical Considerations.

To help our model to start its search from an appropriate point, we first transform the clean data to its latent space representation. Then, we aim to find a small additive latent space perturbation in the form of a normal distribution. Moreover, as suggested in [33], instead of working with (proj𝒮(𝐟(𝐳))){\mathcal{L}\big{(}\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z})\big{)}\big{)}} directly, we normalize them so that they have zero mean and unit variance to help AdvFlows convergence faster. Finally, among different flow-based models, it is preferable to choose those that have a straightforward inverse, such as [8, 27, 11, 10]. This way, we can efficiently go back and forth between the original data and their base distribution representation. Algorithm 1 in Appendix D.1 summarizes our black-box attack method. Other variations of AdvFlow can also be found in Appendix D. These variations include our solution to high-resolution images and investigation of un-trained AdvFlow.

3.4.1 AdvFlow Interpretation

We can interpret AdvFlow from two different perspectives.

First, there exists a probabilistic view: we use the flow-based model transformation of the original data, and then try to adjust it using an affine transformation on its base distribution. The amount of this change is determined by our urge to minimize the C&W cost of Eq. (1) such that we get the minimum value on average. Thus, if it is successful, we will end up having a distribution whose samples are likely to be adversarial. Meanwhile, since this distribution is initialized with that of clean data, it resembles the clean data density closely.

Second, we can think of AdvFlow as a search over the latent space of flow-based models. We map the clean image to the latent space and then try to search in the vicinity of that point to find an adversarial example. This search is ruled by the objective of Eq. (11). Since our approach exploits a fully invertible, pre-trained flow-based model, we would expect to get an adversarial example that resembles the original image in the structure and look less noisy. This adjustment gives our model the flexibility to produce perturbations that take the structure of clean data into account (see Figure 1).

3.4.2 Uniqueness of AdvFlow Perturbations

In this section, we present a lemma about the nature of perturbations generated by AdvFlow and 𝒩\mathcal{N}Attack [33]. As a direct result of this lemma, we can easily deduce that the adversaries generated by AdvFlow can be approximated by a normal distribution whose components are dependent. However, this is not the case for 𝒩\mathcal{N}Attack as they always have independent elements. In this sense, we can then rigorously conclude that the AdvFlow perturbations are unique and cannot be generated by 𝒩\mathcal{N}Attack. Thus, we cannot expect 𝒩\mathcal{N}Attack to be able to generate perturbations that look like the original data. This result can also be generalized to many other attack methods as they often use an additive, independent perturbation. Proofs can be found in Appendix A.2.

Lemma 3.1.

Let 𝐟(𝐱){\mathbf{f}(\mathbf{x})} be an invertible, differentiable function. For a small perturbation 𝛅z{\boldsymbol{\delta}_{z}} we have

𝜹=𝐟(𝐟1(𝐱)+𝜹z)𝐱(𝐟1(𝐱))1𝜹z.\boldsymbol{\delta}=\mathbf{f}\big{(}\mathbf{f}^{-1}(\mathbf{x})+\boldsymbol{\delta}_{z}\big{)}-\mathbf{x}\approx\big{(}\nabla\mathbf{f}^{-1}(\mathbf{x})\big{)}^{-1}\boldsymbol{\delta}_{z}.
Corollary 3.1.1.

The adversarial perturbations generated by AdvFlow have dependent components. In contrast, 𝒩{\mathcal{N}}Attack perturbation components are independent.

4 Experimental Results

In this section, we present our experimental results. First, we see how the adversarial examples generated by the proposed model can successfully mislead adversarial example detectors. Then, we show the attack success rate and the number of queries required to attack both vanilla and defended models. Finally, we examine the transferability of the generated attacks between defended classifiers. To see the details of the experiments, please refer to Appendix B. Also, more simulation results can be found in Appendices C and D.

For each dataset, we pre-train a flow-based model and fix it across the experiments. To this end, we use a modified version of Real NVP [8] as introduced in [1], the details of which can be found in Appendix B.1. Once trained, we then try to attack target classifiers in a black-box setting using AdvFlow (Algorithm 1).

4.1 Detectability

One approach to defend pre-trained classifiers is to employ adversarial example detectors. This way, a detector is trained and put on top of the classifier. Before feeding inputs to the un-defended classifier, every input has to be checked whether it is adversarial or not. One common assumption among such detectors is that the adversaries come from a different distribution than the clean data [37, 32]. Thus, the performance of these detectors seems to be a suitable measure to quantify the success of our model in generating adversarial examples that have the same distribution as the original data. To this end, we choose LID [37], Mahalanobis [32], and Res-Flow [64] adversarial attack detectors to assess the performance of the proposed approach. We compare our results with 𝒩\mathcal{N}Attack [33] that also approaches the black-box adversarial attack from a distributional perspective for a fair comparison. As an ablation study, we also consider the un-trained version of AdvFlows where the weights of the NF models are set randomly. This way, we can observe the effect of the clean data distribution in misleading adversarial example detectors more precisely. We first generate a set of adversarial examples alongside some noisy ones using the test set. Then, we use 10%10\% of the adversarial, noisy, and clean image data to train adversarial attack detectors. Details of our experiments in this section can be found in Appendix B.2.

We report the area under the receiver operating characteristic curve (AUROC) and the detection accuracy for each case in Table 1. As seen, in almost all the cases the selected adversarial detectors struggle to detect the attacks generated by AdvFlow in contrast to 𝒩\mathcal{N}Attack. These results support our statement earlier about the distribution of the attacks being more similar to that of data, hence the failure of adversarial example detectors. Also, we see that pre-training the AdvFlow using clean data is crucial in fooling adversarial example detectors.

Finally, Figure 2 shows the relative change in the base distribution of the flow-based model for adversarial examples of Table 1. Interestingly, we see that AdvFlow adversaries are distinctively closer to the clean data compared to 𝒩\mathcal{N}Attack [33]. These results highlight the need to reconsider the underlying assumption that adversaries come from a different distribution than the clean data. Also, it can motivate training classifiers that learn data distributions, as our results reveal this is not currently the case.

Table 1: Area under the receiver operating characteristic curve (AUROC) and accuracy of detecting adversarial examples generated by 𝒩\mathcal{N}Attack [33] and AdvFlow (un. for un-trained and tr. for pre-trained NF) using LID [37], Mahalanobis [32], and Res-Flow [64] adversarial attack detectors. In each case, the classifier has a ResNet-34 [18] architecture.
Data Metric AUROC(%) \uparrow Detection Acc.(%) \uparrow
Method 𝒩\mathcal{N}Attack AdvFlow (un.) AdvFlow (tr.) 𝒩\mathcal{N}Attack AdvFlow (un.) AdvFlow (tr.)
CIFAR-10 LID [37] 78.6978.69 84.3984.39 57.59\mathbf{57.59} 72.1272.12 77.1177.11 55.74\mathbf{55.74}
Mahalanobis [32] 97.9597.95 99.5099.50 66.85\mathbf{66.85} 95.5995.59 97.4697.46 62.21\mathbf{62.21}
Res-Flow [64] 97.9097.90 99.4099.40 67.03\mathbf{67.03} 94.5594.55 97.2197.21 62.60\mathbf{62.60}
SVHN LID [37] 57.70\mathbf{57.70} 58.9258.92 61.1161.11 55.60\mathbf{55.60} 56.4356.43 58.2158.21
Mahalanobis [32] 73.1773.17 74.6774.67 64.72\mathbf{64.72} 68.2068.20 69.4669.46 60.88\mathbf{60.88}
Res-Flow [64] 69.7069.70 74.8674.86 64.68\mathbf{64.68} 64.5364.53 68.4168.41 61.13\mathbf{61.13}
Refer to caption
Refer to caption
Figure 2: Relative change in the base distribution of the flow-based model for adversarial examples generated by AdvFlow and 𝒩\mathcal{N}Attack for CIFAR-10 [30] (left) and SVHN [40] (right) classifiers of Table 1.

4.2 Success Rate and Number of Queries

Next, we investigate the performance of the proposed model in attacking vanilla and defended image classifiers. It was shown previously that 𝒩\mathcal{N}Attack struggles to break into adversarially trained models more than any other defense [33]. Thus, we select some of the most recent defense techniques that are built upon adversarial training [38]. This selection also helps us in quantifying the attack transferability in our next experiment. Therefore, we select Free [48] and Fast [60] adversarial training, alongside adversarial training with auxiliary rotations [19] as the defense mechanisms that our classifiers employ. Note that these models are the most recent defenses built upon adversarial training. For a brief explanation of each one of these methods, please refer to Appendix B.3. We then train target classifiers on CIFAR-10 [30] and SVHN [40] datasets. The architecture that we use here is the well-known Wide-ResNet-32 [63] with width 1010. We then try to attack these classifiers by generating adversarial examples on the test set. We compare our proposed model with bandits with time and data-dependent priors [23], 𝒩\mathcal{N}Attack [33], and SimBA [16]. 444Note that SimBA [16] is originally designed for efficient 2\ell_{2} attacks, and it may not use the entire 10,00010,000 query quota for small images. Anyways, we included SimBA [16] in the paper as per one of the reviewers’ suggestions. To simulate a realistic environment, we set the maximum number of queries to 10,00010,000. Moreover, for 𝒩\mathcal{N}Attack and AdvFlow we use a population size of 2020. More details on the defense methods as well as attack hyperparameters can be found in Appendices B.3 and B.4.

Tables 2 and 3 show the success rate as well as the average and median number of queries required to successfully attack a vanilla/defended classifier. Also, Figure 3 in Appendix C shows the attack success rate for AdvFlow and 𝒩\mathcal{N}Attack [33] versus the maximum number of queries for defended models. As can be seen, AdvFlows can improve upon the performance of 𝒩\mathcal{N}Attack [33] in all of the defended models in terms of the number of queries and attack success rate. Also, it should be noted that although our performance on vanilla classifiers is worse than 𝒩\mathcal{N}Attack [33], we are still generating adversaries that are not easily detectable by adversarial example detectors and come from a similar distribution to the clean data.

Table 2: Attack success rate of black-box adversarial attacks on CIFAR-10 [30] and SVHN [40] Wide-ResNet-32 [63] classifiers. All attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255.
Data Success Rate(%) \uparrow
Defense Acc.(%) Bandits[23] 𝒩\mathcal{N}Attack [33] SimBA [16] AdvFlow
CIFAR-10 Vanilla [63] 91.7791.77 98.8198.81 𝟏𝟎𝟎\mathbf{100} 99.9999.99 99.4299.42
FreeAdv [48] 81.2981.29 37.1237.12 38.9738.97 35.5235.52 41.21\mathbf{41.21}
FastAdv [60] 86.3386.33 36.6036.60 36.9036.90 35.0735.07 40.22\mathbf{40.22}
RotNetAdv [19] 86.5886.58 37.7337.73 38.0438.04 35.6335.63 40.67\mathbf{40.67}
SVHN Vanilla [63] 96.4596.45 87.8487.84 98.76\mathbf{98.76} 97.2697.26 90.3190.31
FreeAdv [48] 86.4786.47 49.6449.64 50.2850.28 46.2846.28 50.76\mathbf{50.76}
FastAdv [60] 93.9093.90 40.4340.43 35.4235.42 36.1936.19 41.49\mathbf{41.49}
RotNetAdv [19] 90.3390.33 43.4743.47 41.4941.49 39.0139.01 44.22\mathbf{44.22}
Table 3: Average (median) of the number of queries needed to generate an adversarial example for CIFAR-10 [30] and SVHN [40] Wide-ResNet-32 [63] classifiers of Table 2. For a fair comparison, we first find the samples where all the attack methods are successful, and then compute the average (median) of queries for these samples. Note that for 𝒩\mathcal{N}Attack and AdvFlow we check whether we arrived at an adversarial point every 200200 queries, and hence, the medians are a multiples of 200200.
Data Query Average (Median) on Mutually Successful Attacks \downarrow
Defense Bandits[23] 𝒩\mathcal{N}Attack [33] SimBA [16] AdvFlow
CIFAR-10 Vanilla [63] 552.69(182)552.69~{}(182) 237.58(200)\mathbf{237.58}~{}(200) 237.70(𝟏𝟐𝟔)237.70~{}(\mathbf{126}) 949.31(400)949.31~{}(400)
FreeAdv [48] 1062.7(354)1062.7~{}(354) 874.91(400)874.91~{}(400) 463.09(244)463.09~{}(244) 421.63(𝟐𝟎𝟎)\mathbf{421.63}~{}(\mathbf{200})
FastAdv [60] 1065.92(358)1065.92~{}(358) 973.05(400)973.05~{}(400) 428.81(234)\mathbf{428.81}~{}(234) 436.8(𝟐𝟎𝟎)436.8~{}(\mathbf{200})
RotNetAdv [19] 1085.43(408)1085.43~{}(408) 941.67(400)941.67~{}(400) 471.99(259)471.99~{}(259) 424.95(𝟐𝟎𝟎)\mathbf{424.95}~{}(\mathbf{200})
SVHN Vanilla [63] 1750.65(1128)1750.65~{}(1128) 408.75(200)408.75~{}(200) 202.07(𝟏𝟎𝟕)\mathbf{202.07}~{}(\mathbf{107}) 1572.24(600)1572.24~{}(600)
FreeAdv [48] 819.98(250)819.98~{}(250) 903.12(400)903.12~{}(400) 365.42(216)\mathbf{365.42}~{}(216) 692.73(𝟐𝟎𝟎)692.73~{}(\mathbf{200})
FastAdv [60] 755.23(284)755.23~{}(284) 1243.38(600)1243.38~{}(600) 307.73(216)\mathbf{307.73}~{}(216) 526.37(𝟐𝟎𝟎)526.37~{}(\mathbf{200})
RotNetAdv [19] 663.07(202)663.07~{}(202) 756.48(400)756.48~{}(400) 319.93(𝟏𝟖𝟔)\mathbf{319.93}~{}(\mathbf{186}) 480.02(200)480.02~{}(200)

4.3 Transferability

Finally, we examine the transferability of the generated attacks for each of the classifiers in Table 2. In other words, we generate attacks using a substitute classifier, and then try to attack another target model. The results of this experiment are shown in Figure 4 of Appendix C. As seen, the generated attacks by AdvFlow transfer to other defended models easier than the vanilla one. This observation precisely matches our intuition about the mechanics of AdvFlow. More specifically, we know that in AdvFlow the model is learning a distribution that is more expressive than the one used by 𝒩\mathcal{N}Attack. Also, we have seen in Section 3.4.2 that the perturbations generated by AdvFlow have dependent elements in contrast to 𝒩\mathcal{N}Attack. As a result, AdvFlow learns to attack classifiers using higher-level features (Figure 1). Thus, since vanilla classifiers use different features for classification than the defended ones, AdvFlows are less transferable from defended models to vanilla ones. In contrast, the expressiveness of AdvFlows enables the attacks to be transferred more successfully between adversarially trained classifiers, and from vanilla to defended ones.

5 Conclusion and Future Directions

In this paper, we introduced AdvFlow: a novel adversarial attack model that utilizes the capacity of normalizing flows in representing data distributions. We saw that the adversarial perturbations generated by the proposed approach can be approximated using normal distributions with dependent components. In this sense, 𝒩\mathcal{N}Attack [33] cannot generate such adversaries. As a result, AdvFlows are less conspicuous to adversarial example detectors in contrast to their 𝒩\mathcal{N}Attack [33] counterpart. This success is due to AdvFlow being pre-trained on the data distribution, resulting in adversaries that look like the clean data. We also saw the capability of the proposed method in improving the performance of bandits [23], 𝒩\mathcal{N}Attack [33], and SimBA [16] on adversarially trained classifiers. This improvement is in terms of both attack success rate and the number of queries.

Flow-based modeling is an active area of research. There are numerous extensions to the current work that can be investigated upon successful expansion of normalizing flow models in their range and power. For example, while 𝒩\mathcal{N}Attack [33] and other similar approaches [22, 23] are specifically designed for use on image data, the current work can potentially be expanded to entail other forms of data such as graphs [35, 49]. Also, since normalizing flows can effectively model probability distributions, finding the distribution of well-known perturbations may lead to increasing classifier robustness against adversarial examples. We hope that this work can provide a stepping stone to exploiting such powerful models for adversarial machine learning.

Broader Impact

In this paper, we introduce a novel adversarial attack algorithm called AdvFlow. It uses pre-trained normalizing flows to generate adversarial examples. This study is crucial as it indicates the vulnerability of deep neural network (DNN) classifiers to adversarial attacks.

More precisely, our study reveals that the common assumption made by adversarial example detectors (such as the Mahalanobis detector [32]) that the adversaries come from a different distribution than the data may not be an accurate one. In particular, we show that we can generate adversaries that come from a close distribution to the data, yet they intend to mislead the classifier decision. Thus, we emphasize that adversarial example detectors need to adjust their assumption about the distribution of adversaries before being deployed in real-world situations.

Furthermore, since our adversarial examples are closely related to the data distribution, our method shows that DNN classifiers are not learning to classify the data based on their underlying distribution. Otherwise, they would have resisted the attacks generated by AdvFlow. Thus, it can bring the attention of the machine learning community to training their DNN classifiers in a distributional sense.

All in all, we pinpoint a failure of DNN classifiers to the rest of the community so that they can become familiar with the limitations of the status-quo. This study, and similar ones, could raise awareness among researchers about the real-world pitfalls of DNN classifiers, with the aim of consolidating them against such threats in the future.

Acknowledgments and Disclosure of Funding

We would like to thank the reviewers for their valuable feedback on our work, helping us to improve the final manuscript. We also would like to thank the authors and maintainers of PyTorch [43], NumPy [17], and Matplotlib [21].

This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.

References

  • Ardizzone et al., [2019] Ardizzone, L., Lüth, C., Kruse, J., Rother, C., and Köthe, U. (2019). Guided image generation with conditional invertible neural networks. CoRR, abs/1907.02392.
  • Baluja and Fischer, [2018] Baluja, S. and Fischer, I. (2018). Learning to attack: Adversarial transformation networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 2687–2695.
  • Bhambri et al., [2019] Bhambri, S., Muku, S., Tulasi, A., and Buduru, A. B. (2019). A survey of black-box adversarial attacks on computer vision models. CoRR, abs/1912.01667.
  • Carlini and Wagner, [2017] Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57.
  • Chen et al., [2017] Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., and Hsieh, C.-J. (2017). ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26.
  • Coleman et al., [2017] Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., Nardi, L., Bailis, P., Olukotun, K., Ré, C., and Zaharia, M. (2017). DAWNBench: An end-to-end deep learning benchmark and competition. NeurIPS ML Systems Workshop.
  • Dinh et al., [2015] Dinh, L., Krueger, D., and Bengio, Y. (2015). NICE: non-linear independent components estimation. In Workshop Track Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • Dinh et al., [2017] Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real NVP. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
  • [9] Dolatabadi, H. M., Erfani, S. M., and Leckie, C. (2020a). Black-box adversarial example generation with normalizing flows. In Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 37th International Conference on Machine Learning (ICML).
  • [10] Dolatabadi, H. M., Erfani, S. M., and Leckie, C. (2020b). Invertible generative modeling using linear rational splines. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4236–4246.
  • Durkan et al., [2019] Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural spline flows. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 7511–7522.
  • Eykholt et al., [2018] Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. (2018). Robust physical-world attacks on deep learning visual classification. In Proceeding of the 2018 IEEE Conference on Computer Vision and Pattern Recognition CVPR, pages 1625–1634.
  • Gidaris et al., [2018] Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
  • Goodfellow et al., [2015] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • [15] Guo, C., Frank, J. S., and Weinberger, K. Q. (2019a). Low frequency adversarial perturbation. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI), page 411.
  • [16] Guo, C., Gardner, J. R., You, Y., Wilson, A. G., and Weinberger, K. Q. (2019b). Simple black-box adversarial attacks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 2484–2493.
  • Harris et al., [2020] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R’ıo, J. F., Wiebe, M., Peterson, P., G’erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825):357–362.
  • He et al., [2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
  • Hendrycks et al., [2019] Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. (2019). Using self-supervised learning can improve model robustness and uncertainty. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 15637–15648.
  • Huang and Zhang, [2020] Huang, Z. and Zhang, T. (2020). Black-box adversarial attack with transferable model-based embedding. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Hunter, [2007] Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3):90–95.
  • Ilyas et al., [2018] Ilyas, A., Engstrom, L., Athalye, A., and Lin, J. (2018). Black-box adversarial attacks with limited queries and information. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 2142–2151.
  • Ilyas et al., [2019] Ilyas, A., Engstrom, L., and Madry, A. (2019). Prior convictions: Black-box adversarial attacks with bandits and priors. In Proceedings of the 7th International Conference on Learning Representations (ICLR).
  • Jacobsen et al., [2018] Jacobsen, J., Smeulders, A. W. M., and Oyallon, E. (2018). i-RevNet: Deep invertible networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
  • Jiang et al., [2019] Jiang, L., Ma, X., Chen, S., Bailey, J., and Jiang, Y. (2019). Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia, pages 864–872.
  • Kingma and Ba, [2015] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • Kingma and Dhariwal, [2018] Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 10236–10245.
  • Kobyzev et al., [2020] Kobyzev, I., Prince, S., and Brubaker, M. A. (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Kolter and Madry, [2018] Kolter, Z. and Madry, A. (2018). Adversarial robustness: Theory and practice. https://adversarial-ml-tutorial.org/. Tutorial in the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS).
  • Krizhevsky and Hinton, [2009] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto.
  • Laidlaw and Feizi, [2019] Laidlaw, C. and Feizi, S. (2019). Functional adversarial attacks. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 10408–10418.
  • Lee et al., [2018] Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 7167–7177.
  • Li et al., [2019] Li, Y., Li, L., Wang, L., Zhang, T., and Gong, B. (2019). NATTACK: learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3866–3876.
  • Liang et al., [2018] Liang, B., Li, H., Su, M., Bian, P., Li, X., and Shi, W. (2018). Deep text classification can be fooled. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), pages 4208–4215.
  • Liu et al., [2019] Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. (2019). Graph normalizing flows. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 13556–13566.
  • Liu et al., [2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pages 3730–3738.
  • Ma et al., [2018] Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S. N. R., Schoenebeck, G., Song, D., Houle, M. E., and Bailey, J. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
  • Madry et al., [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
  • Moon et al., [2019] Moon, S., An, G., and Song, H. O. (2019). Parsimonious black-box adversarial attacks via efficient combinatorial optimization. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 4636–4645.
  • Netzer et al., [2011] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning.
  • Papamakarios et al., [2019] Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2019). Normalizing flows for probabilistic modeling and inference. CoRR, abs/1912.02762.
  • Papernot et al., [2016] Papernot, N., McDaniel, P. D., and Goodfellow, I. J. (2016). Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277.
  • Paszke et al., [2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop.
  • Rezende and Mohamed, [2015] Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1530–1538.
  • Rudin et al., [1964] Rudin, W. et al. (1964). Principles of Mathematical Analysis, volume 3. McGraw-Hill New York.
  • Russakovsky et al., [2015] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252.
  • Salimans et al., [2017] Salimans, T., Ho, J., Chen, X., and Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. CoRR, abs/1703.03864.
  • Shafahi et al., [2019] Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson, J. P., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. (2019). Adversarial training for free! In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3353–3364.
  • Shi* et al., [2020] Shi*, C., Xu*, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. (2020). Graphaf: a flow-based autoregressive model for molecular graph generation. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Simonyan and Zisserman, [2015] Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • Song et al., [2018] Song, Y., Shu, R., Kushman, N., and Ermon, S. (2018). Constructing unrestricted adversarial examples with generative models. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 8322–8333.
  • Szegedy et al., [2016] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
  • Szegedy et al., [2014] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations (ICLR).
  • Tabak and Turner, [2013] Tabak, E. G. and Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164.
  • Tu et al., [2019] Tu, C., Ting, P., Chen, P., Liu, S., Zhang, H., Yi, J., Hsieh, C., and Cheng, S. (2019). AutoZOOM: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 742–749.
  • Wang and Yu, [2019] Wang, H. and Yu, C. (2019). A direct approach to robust deep learning using adversarial networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR).
  • Wasserman, [2013] Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
  • Wierstra et al., [2014] Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., and Schmidhuber, J. (2014). Natural evolution strategies. Journal of Machine Learning Research (JMLR), 15(1):949–980.
  • Wierstra et al., [2008] Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Natural evolution strategies. In Proceedings of the 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 3381–3387.
  • Wong et al., [2020] Wong, E., Rice, L., and Kolter, J. Z. (2020). Fast is better than free: Revisiting adversarial training. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Xiao et al., [2018] Xiao, C., Li, B., Zhu, J., He, W., Liu, M., and Song, D. (2018). Generating adversarial examples with adversarial networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), pages 3905–3911.
  • Yuan et al., [2019] Yuan, X., He, P., Zhu, Q., and Li, X. (2019). Adversarial examples: Attacks and defenses for deep learning. IEEE Transactions on Neural Networks and Learning Systems, 30(9):2805–2824.
  • Zagoruyko and Komodakis, [2016] Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC).
  • Zisselman and Tamar, [2020] Zisselman, E. and Tamar, A. (2020). Deep residual flow for out of distribution detection. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13991–14000.
  • Zügner et al., [2018] Zügner, D., Akbarnejad, A., and Günnemann, S. (2018). Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 2847–2856.

Supplementary Materials

This supplementary document includes the following content to support the material presented in the paper:

  • In Section A, we present the “log-likelihood trick" and the proof to our lemma and corollary.

  • In Section B, we give the implementation details of our algorithm and experiments. Besides introducing the flow-based model architecture, we present a detailed explanation of classifier architectures and defense mechanisms used to evaluate our method. The set of hyperparameters used in each defense and attack model is also given.

  • In Section C, we present an extended version of our simulation results. Moreover, we investigate the effect of the training data on our algorithm’s performance.

  • In Section D, we see important extensions to the current work. These extensions include our solution to high-resolution images and un-trained AdvFlow.

Appendix A Mathematical Details

A.1 The Log-likelihood Trick

Here we provide the complete proof of the “log-likelihood trick” as presented in [58]:

𝝍J(𝝍)\displaystyle\nabla_{\boldsymbol{\psi}}J(\boldsymbol{\psi}) =𝝍𝔼p(𝐱|𝝍)[(𝐱)]\displaystyle=\nabla_{\boldsymbol{\psi}}\mathbb{E}_{p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}\left[\mathcal{L}(\mathbf{x}^{\prime})\right]
=𝝍(𝐱)p(𝐱|𝝍)d𝐱\displaystyle=\nabla_{\boldsymbol{\psi}}\int\mathcal{L}(\mathbf{x}^{\prime})p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\mathrm{d}\mathbf{x}^{\prime}
=(𝐱)𝝍p(𝐱|𝝍)d𝐱\displaystyle=\int\mathcal{L}(\mathbf{x}^{\prime})\nabla_{\boldsymbol{\psi}}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\mathrm{d}\mathbf{x}^{\prime}
=(𝐱)𝝍p(𝐱|𝝍)p(𝐱|𝝍)p(𝐱|𝝍)d𝐱\displaystyle=\int\mathcal{L}(\mathbf{x}^{\prime})\frac{\nabla_{\boldsymbol{\psi}}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}{p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\mathrm{d}\mathbf{x}^{\prime}
=(𝐱)𝝍log(p(𝐱|𝝍))p(𝐱|𝝍)d𝐱\displaystyle=\int\mathcal{L}(\mathbf{x}^{\prime})\nabla_{\boldsymbol{\psi}}\log\big{(}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\big{)}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\mathrm{d}\mathbf{x}^{\prime}
=𝔼p(𝐱|𝝍)[(𝐱)𝝍log(p(𝐱|𝝍))].\displaystyle=\mathbb{E}_{p(\mathbf{x^{\prime}}|\boldsymbol{\psi})}\left[\mathcal{L}(\mathbf{x}^{\prime})\nabla_{\boldsymbol{\psi}}\log\big{(}p(\mathbf{x^{\prime}}|\boldsymbol{\psi})\big{)}\right].

A.2 Proofs

Lemma A.1.

Let 𝐟(𝐱){\mathbf{f}(\mathbf{x})} be an invertible, differentiable function. For a small perturbation 𝛅z{\boldsymbol{\delta}_{z}} we have

𝜹=𝐟(𝐟1(𝐱)+𝜹z)𝐱(𝐟1(𝐱))1𝜹z.\boldsymbol{\delta}=\mathbf{f}\big{(}\mathbf{f}^{-1}(\mathbf{x})+\boldsymbol{\delta}_{z}\big{)}-\mathbf{x}\approx\big{(}\nabla\mathbf{f}^{-1}(\mathbf{x})\big{)}^{-1}\boldsymbol{\delta}_{z}.
Proof.

By the first-order Taylor series for 𝐟():dd{\mathbf{f}\left(\cdot\right):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}} we have

𝐟(𝐳+𝜹z)𝐟(𝐳)+𝐟(𝐳)𝜹z,\mathbf{f}(\mathbf{z}+\boldsymbol{\delta}_{z})\approx\mathbf{f}(\mathbf{z})+\nabla\mathbf{f}(\mathbf{z})\boldsymbol{\delta}_{z},

where 𝐟(𝐳){\nabla\mathbf{f}(\mathbf{z})} is the d×d{d\times d} Jacobian matrix of the function 𝐟(){\mathbf{f}(\cdot)}. Now, by substituting 𝐳=𝐟1(𝐱){\mathbf{z}=\mathbf{f}^{-1}(\mathbf{x})}, and using the inverse function theorem [45], we can write

𝐟(𝐟1(𝐱)+𝜹z)𝐟(𝐟1(𝐱))+(𝐟1(𝐱))1𝜹z\displaystyle\mathbf{f}(\mathbf{f}^{-1}(\mathbf{x})+\boldsymbol{\delta}_{z})\approx\mathbf{f}(\mathbf{f}^{-1}(\mathbf{x}))+\big{(}\nabla\mathbf{f}^{-1}(\mathbf{x})\big{)}^{-1}\boldsymbol{\delta}_{z}

which gives us the result immediately. ∎

Corollary A.1.1.

The adversarial perturbations generated by AdvFlow have dependent components. In contrast, 𝒩{\mathcal{N}}Attack perturbation components are independent.

Proof.

By Lemma 3.1 we know that

𝜹(𝐟1(𝐱))1𝜹z,\boldsymbol{\delta}\approx\big{(}\nabla\mathbf{f}^{-1}(\mathbf{x})\big{)}^{-1}\boldsymbol{\delta}_{z},

where 𝜹z{\boldsymbol{\delta}_{z}} is distributed according to an isometric normal distribution. For 𝒩\mathcal{N}Attack, we have 𝐟(𝐳)=12(tanh(𝐳)+1){\mathbf{f}(\mathbf{z})=\frac{1}{2}\big{(}\tanh(\mathbf{z})+1\big{)}}. Thus, it can be shown that 𝐟1(𝐱)=diag(2𝐱(1𝐱))\nabla\mathbf{f}^{-1}(\mathbf{x})=\mathrm{diag}\big{(}2\mathbf{x}\odot(1-\mathbf{x})\big{)}. As a result, the random vector 𝜹\boldsymbol{\delta} will be still normally distributed with a diagonal covariance matrix, and hence, have independent components. In contrast, we know that for an effective flow-based model 𝐟1(𝐱)\nabla\mathbf{f}^{-1}(\mathbf{x}) is not always diagonal. Otherwise, this means that our NF is simply a data-independent affine transformation. For example, in Real NVP [8] which we use, this matrix is a product of lower and upper triangular matrices. Hence, for a normalizing flow model 𝐟()\mathbf{f}(\cdot) we have non-diagonal 𝐟1(𝐱)\nabla\mathbf{f}^{-1}(\mathbf{x}). Thus, it will make the random variable 𝜹\boldsymbol{\delta} normal with correlated (dependent) components. ∎

Appendix B Implementation Details

In this section, we present the implementation details of our algorithm and experiments. Note that all of the experiments were run using a single NVIDIA Tesla V100-SXM2-16GB GPU.

B.1 Normalizing Flows

For the flow-based models used in AdvFlow, we implement Real NVP [8] by using the framework of Ardizzone et al., [1].555github.com/VLL-HD/FrEIA In particular, let 𝐳=[𝐳1,𝐳2]{\mathbf{z}=[\mathbf{z}_{1},\mathbf{z}_{2}]} denote the input to one layer of a normalizing flow transformation. If we denote the output of this layer by 𝐱=[𝐱1,𝐱2]{\mathbf{x}=[\mathbf{x}_{1},\mathbf{x}_{2}]}, then the Real NVP transformation between the input and output of this particular layer can be written as:

𝐱1\displaystyle\mathbf{x}_{1} =𝐳1exp(𝐬1(𝐳2))+𝐭1(𝐳2)\displaystyle=\mathbf{z}_{1}\odot\exp\big{(}\mathbf{s}_{1}\left({\mathbf{z}_{2}}\right)\big{)}+\mathbf{t}_{1}\left({\mathbf{z}_{2}}\right)
𝐱2\displaystyle\mathbf{x}_{2} =𝐳2exp(𝐬2(𝐱1))+𝐭2(𝐱1),\displaystyle=\mathbf{z}_{2}\odot\exp\big{(}\mathbf{s}_{2}\left({\mathbf{x}_{1}}\right)\big{)}+\mathbf{t}_{2}\left({\mathbf{x}_{1}}\right),

where {\odot} is an element-wise multiplication. The functions 𝐬1,2(){\mathbf{s}_{1,2}(\cdot)} and 𝐭1,2(){\mathbf{t}_{1,2}(\cdot)} are called the scaling and translation functions. Since the invertibility of the transformation does not depend on these functions, they are implemented using ordinary neural networks. To help with the stability of the transformation, Ardizzone et al., [1] suggest using a soft-clamp before passing the output of scaling networks 𝐬1,2(){\mathbf{s}_{1,2}(\cdot)} to exponential function. This soft-clamp function is implemented by

sclamp=2απarctan(sα),s_{\rm clamp}=\frac{2\alpha}{\pi}\arctan\big{(}\frac{s}{\alpha}\big{)},

where α{\alpha} is a hyperparameter that controls the amount of softening. In our experiments, we set α=1.5{\alpha=1.5}. Moreover, at the end of each layer of transformation, we permute the output so that we end up getting different partitions of the data as 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}. The pattern by which the data is permuted is set at random at the beginning of the training process and kept fixed onwards.

After passing the data through some high-resolution transformations, we downsample it using i-RevNet downsamplers [24]. Specifically, the high-resolution input is downsampled so that each one of them constitutes a different channel of the low-resolution data.

To help the normalizing flow model learn useful features, we use a fixed 1×11\times 1 convolution at the beginning of each low-resolution layer. This adjustment is done with the same spirit as in Glow [27]. However, instead of having a trainable 1×11\times 1 convolution, here we initialize them at the beginning of the training and keep them fixed afterward.

Finally, we used a multi-scale structure [8] to reduce the computational complexity of the flow-based model. Specifically, we pass the input through several layers of invertible transformations constructed using convolutional neural networks as 𝐬1,2(){\mathbf{s}_{1,2}(\cdot)} and 𝐭1,2(){\mathbf{t}_{1,2}(\cdot)}. Then, we send three-quarters of the data directly to the ultimate output. The rest goes through other rounds of mappings, which use fully-connected networks. This way, one can reduce the computational burden of flow-based models as they keep the data dimension fixed.

For training, we use an Adam [26] optimizer with weight decay 10510^{-5}. Besides, we set the learning rate according to an exponential scheduler starting from 10410^{-4} and ending to 10610^{-6}. Also, to dequantize the image pixels, we add a small Gaussian noise with σ=0.02{\sigma=0.02} to the pictures. Table 4 summarizes the hyperparameters and architecture of the flow-based model used in AdvFlow.

Table 4: Hyperparameter and architecture details for normalizing flow part of AdvFlow.
Optimizer Adam
Scheduler Exponential
Initial lr 10410^{-4}
Final lr 10610^{-6}
Batch Size 6464
Epochs 350350
Added Noise Std. 0.020.02
Multi-scale Levels 22
Each Level Network Type CNN-FC
High-res Transformation Blocks 44
Low-res Transformation Blocks 66
FC Transformation Blocks 66
α\alpha (clamping hyperparameter) 1.51.5
CNN Layers Hidden Channels 128128
FC Layers Internal Width 128128
Activation Function Leaky ReLU
Leaky Slope 0.10.1

B.2 Adversarial Example Detectors

In this section, we provide the details of the LID [37], Mahalanobis [32], and ResFlow [64] adversarial example detectors. All of the methods use logistic regression as their classifier, and the way that they construct their training and evaluation sets is the same. The only difference among these methods is the way each one extracts their features, which we review below. The training set used to train the logistic regression classifier consists of three types of data: clean, noisy, and adversarial. We take the test portion of each target dataset and add a slight noise to them to make the noisy data. The clean and noisy data are going to be used as the positive samples of the logistic regression. For the adversarial part of the data, we then use a nominated adversarial attack method and generate adversarial examples that are later used as negative samples of the logistic regression. After constructing the entire dataset, 10%10\% of it is used as the logistic regression training set, and the rest for evaluation. Also, the hyperparameters of the detectors are set using nested cross-validation.

LID Detectors.

Ma et al., [37] use the concept of Local Intrinsic Dimensionality (LID) to characterize adversarial subspaces. It is argued that for a data point that resides on some high-dimensional submanifold, its adversarially generated sample is likely to lie outside this submanifold. As such, Ma et al., [37] argue that the intrinsic dimensionality of the adversarial examples in a local neighborhood is going to be higher than the clean or noisy data (see Figure 1 of [37]). Thus, LID can be a good measure for differentiating adversarial examples from clean data. Ma et al., [37] then estimate the LID measure for mini-batches of data using the extreme value theory. To this end, they extract features of the input images using a DNN classifier. They then compute the LID score for all these features across the training and evaluation sets. After extracting these scores for all the data, they train and evaluate the logistic regression classifier as described above. Here, we use the PyTorch implementation of LID detectors given by Lee et al., [32].

Mahalanobis Detectors.

Lee et al., [32] propose an adversarial example detector based on a Mahalanobis distance-based confidence score. To this end, the authors extract features from different hidden layers of a nominated DNN classifier. Assuming that these features are distributed according to class-conditional Gaussian densities, the detector aims at estimating the mean and covariance matrix associated with each one of the features across the training set. These densities are then used to train the logistic regression classifier based on the Mahalanobis distance confidence score between a given image feature and its closest distribution. In this paper, we use the official implementation of the Mahalanobis adversarial example detector available online.666github.com/pokaxpoka/deep_Mahalanobis_detector For more information about these detectors, see [32].

ResFlow Detectors.

Zisselman and Tamar, [64] generalize the Mahalanobis detectors [32] using normalizing flows. It is first argued that modeling the activation distributions as Gaussian densities may not be accurate. To find a better non-Gaussian distribution, Zisselman and Tamar, [64] exploit flow-based models to construct an architecture they call Residual Flow (ResFlow). The same procedure as in the Mahalanobis detectors is then utilized to extract features that are later used to train the logistic regression detectors. We use the official PyTorch implementation of ResFlow available online.777github.com/EvZissel/Residual-Flow Note that in the original paper, ResFlows are only used for out-of-distribution detection. For our purposes, we generalized their implementation to adversarial example detection using the Mahalanobis detector implementation.

B.3 Defense Methods

In this section, we briefly review the defense techniques used in our experiments. We will then present the set of parameters used in the training of each one of the classifiers. Note that we only utilized these methods for the sake of evaluating our attack models, and our results cannot be regarded as a close case-study, or comparison, of the nominated defense methods.

B.3.1 Review of Defense Methods

Adversarial Training.

Adversarial training [38] is a method to train robust classifiers. To achieve robustness, this method tries to incorporate adversarial examples into the training process. In particular, adversarial training aims at minimizing the following objective function for classifier 𝒞()\mathcal{C}(\cdot) with parameters 𝜽\boldsymbol{\theta}:

min𝜽imax𝜹ϵ(𝒞(𝐱i+𝜹),yi).\min_{\boldsymbol{\theta}}\sum_{i}\max_{\left\lVert\boldsymbol{\delta}\right\rVert\leq\epsilon}\ell\big{(}\mathcal{C}(\mathbf{x}_{i}+\boldsymbol{\delta}),y_{i}\big{)}. (14)

Here, 𝐱i\mathbf{x}_{i} and yiy_{i} are the training examples and their associated correct labels. Also, ()\ell(\cdot) is an appropriate cost function for classifiers, such as the standard cross-entropy loss. The inner maximization objective in Eq. (14) is the cost function used to generate adversarial examples. Thus, we can interpret Eq. (14) as training a model that can predict the labels correctly, even in the presence of additive perturbations. However, finding the exact solution to the inner optimization problem is not straightforward, and in most real-world cases cannot be done efficiently [29]. To circumvent this problem, Madry et al., [38] proposed approximately solving it by using a Projected Gradient Descent algorithm. This method is widely known as adversarial training.

Adversarial Training for Free.

The main disadvantage of adversarial training, as proposed in [38], is that solving the inner optimization problem makes the algorithm much slower than standard classifier training. This problem arises because solving the inner maximization objective requires back-propagating through the DNN. To address this issue, Shafahi et al., [48] exploits a Fast Gradient Sign Method (FGSM) [14] with step-size ϵ\epsilon to compute an approximate solution to the inner maximization objective and then update the DNN parameters. This procedure is repeated mm times on the same minibatch. Finally, the total number of epochs is divided by a factor of mm to account for repeated minibatch training. We use the PyTorch code available on the official repository of the free adversarial training to train our classifiers.888github.com/mahyarnajibi/FreeAdversarialTraining

Fast Adversarial Training.

To make adversarial training even faster, Wong et al., [60] came up with a method called “fast" adversarial training. In this approach, they combine FGSM adversarial training with the idea of random initialization to train robust DNN classifiers. To make the proposed algorithm even faster, Wong et al., [60] also utilize several fast training techniques (such as cyclic learning rate and mixed-precision arithmetic) from DAWNBench competition [6]. In this paper, we replace the FGSM adversarial training with PGD. However, we still use the cyclic learning rate and mixed-precision arithmetic. For this method, we used the official PyTorch code available online.999github.com/anonymous-sushi-armadillo

Adversarial Training with Auxiliary Rotations.

Gidaris et al., [13] showed that Convolutional Neural Networks can learn useful image features in an unsupervised fashion by predicting the amount of rotation applied to a given image. Throughout their experiments, they observed that these features can improve classification performance. Motivated by these observations, Hendrycks et al., [19] suggest exploiting the idea of self-supervised feature learning to improve the robustness of classifiers against adversaries. Specifically, it is proposed to train a so-called “head” alongside the original classifier. This auxiliary head takes the penultimate features of the classifier and aims at predicting the amount of rotation applied to an image from four possible angles (00^{\circ}, 9090^{\circ}, 180180^{\circ}, or 270270^{\circ}). It was shown that this simple addition can improve the performance of adversarially trained classifiers. To train our models, we make use of the PyTorch code for adversarial learning with auxiliary rotations available online.101010github.com/hendrycks/ss-ood

B.3.2 Hyperparameters of Defense Methods

Tables 5 and 6 summarize the hyperparameters used for training our defended classifiers.

Table 5: Hyperparameters of defense methods for training CIFAR-10 [30] classifiers. Numbers 1 and 2 correspond to Wide-ResNet-32 [63] and ResNet-50 [18] architectures, respectively.
Classifier Free-1 Free-2 Fast-1 Fast-2 RotNet-1 RotNet-2
Optimizer SGD SGD SGD SGD SGD SGD
lr 0.10.1 0.0050.005 0.210.21 0.210.21 0.10.1 0.10.1
Momentum 0.90.9 0.90.9 0.90.9 0.90.9 0.90.9 0.90.9
Weight Decay 0.00020.0002 0.00020.0002 0.00050.0005 0.00050.0005 0.00050.0005 0.00050.0005
Nesterov N N N N Y Y
Batch Size 128128 6464 6464 128128 128128 128128
Epochs 125125 100100 100100 100100 100100 100100
Inner Optimization FGSM FGSM PGD PGD PGD PGD
ϵ\epsilon 8/2558/255 8/2558/255 8/2558/255 8/2558/255 8/2558/255 8/2558/255
Step Size 8/2558/255 8/2558/255 2/2552/255 2/2552/255 2/2552/255 2/2552/255
Number of Steps (Repeats) 88 88 55 55 1010 1010
Table 6: Hyperparameters of defense methods for training SVHN [40] classifiers. Numbers 1 and 2 correspond to Wide-ResNet-32 [63] and ResNet-50 [18] architectures, respectively.
Classifier Free-1 Free-2 Fast-1 Fast-2 RotNet-1 RotNet-2
Optimizer SGD SGD SGD SGD SGD SGD
lr 0.00010.0001 0.010.01 0.210.21 0.210.21 0.10.1 0.10.1
Momentum 0.90.9 0.90.9 0.90.9 0.90.9 0.90.9 0.90.9
Weight Decay 0.00020.0002 0.00020.0002 0.00050.0005 0.00050.0005 0.00050.0005 0.00050.0005
Nesterov N N N N Y Y
Batch Size 128128 128128 6464 128128 128128 128128
Epochs 100100 100100 100100 100100 100100 100100
Inner Optimization FGSM FGSM PGD PGD PGD PGD
ϵ\epsilon 8/2558/255 8/2558/255 8/2558/255 8/2558/255 8/2558/255 8/2558/255
Step Size 8/2558/255 8/2558/255 2/2552/255 2/2552/255 2/2552/255 2/2552/255
Number of Steps (Repeats) 88 88 55 55 1010 1010

B.4 Hyperparameters of Attack Methods

In this part, we present the set of hyperparameters used for each attack method. For 𝒩\mathcal{N}Attack [33] and AdvFlow, we tune the hyperparameters on a development set so that they result in the best performance for an un-defended CIFAR-10 classifier. In the case of bandits with time and data-dependent priors [23], we use two sets of hyperparameters tuned for these methods. For the vanilla classifiers we use the hyperparameters set in [23], while for defended classifiers we use those set in [39]. For SimBA [16], we used the hyperparameters set in the official repository. We only changed the stride from 77 to 66 to allow for the correct computation of block reordering. Once set, we keep the hyperparameters fixed throughout the rest of experiments. Tables 7-10 summarize the hyperparameters used for each attack method in our experiments.

Table 7: Hyperparameters of bandits with time and data-dependent priors [23].
Hyperparameter Vanilla Defended
OCO learning rate 100100 0.10.1
Image learning rate 0.010.01 0.010.01
Bandit exploration 0.10.1 0.10.1
Finite difference probe 0.10.1 0.10.1
Tile size (6px)2(6\mathrm{px})^{2} (4px)2(4\mathrm{px})^{2}
Table 8: Hyperparameters of 𝒩\mathcal{N}Attack [33].
Hyperparameter Value
σ\sigma (noise std.) 0.10.1
Sample size 2020
Learning rate 0.020.02
Maximum iteration 500500
Table 9: Hyperparameters of SimBA [16].
Hyperparameter Value
ϵ\epsilon 0.20.2
Freq. Dimensionality 1414
Order Strided
Stride 66
Table 10: Hyperparameters of AdvFlow (ours).
Hyperparameter Value
σ\sigma (noise std.) 0.10.1
Sample size 2020
Learning rate 0.020.02
Maximum iteration 500500

Appendix C Extended Experimental Results

In this section, we present an extended version of our experimental results.

C.1 Table of Attack Success Rate and Number of Queries

Table 11 presents attack success rate, as well as average and median of the number of queries for AdvFlow alongside bandits [23] and 𝒩\mathcal{N}Attack [33]. In each case, we have also shown the clean data accuracy and success rate of the white-box PGD-100 attack for reference. Details of classifier training and defense mechanism can be found in Appendix B.3. As can be seen, when it comes to attacking defended models, AdvFlow can outperform the baselines in both the number of queries and attack success rate.

C.2 Success Rate vs. Number of Queries

Figure 3 shows the success rate of AdvFlow and 𝒩\mathcal{N}Attack [33] as a function of the maximum number of queries for defended models. As can be seen, given a fixed number of queries, AdvFlow can generate more successful attacks.

C.3 Confusion Matrices of Transferability

Figure 4 shows the transferability rate of generated attacks to various classifiers. Each entry shows the success rate of adversarial examples intended to attack the row-wise classifier in attacking the column-wise classifier. There are a few points worth mentioning regarding these results:

  • AdvFlow attacks are more transferable between defended models than vanilla to defended models. We argue that the underlying reason is the fact that AdvFlow learns a higher-level perturbation to attack DNNs. As a result, since vanilla classifiers use different features than the defended ones, they are less adaptable to attack defended classifiers. In contrast, since 𝒩{\mathcal{N}}Attack acts on a pixel level, they are less susceptible to this issue.

  • Generally, adversarial examples generated by AdvFlow are more transferable between different architectures than 𝒩{\mathcal{N}}Attack. The same argument as in our previous point applies here.

  • Transferability of black-box attacks is not as important as in the white-box setting. The reason is that in the case of black-box attacks, since no assumption is made about the model architecture, we can try to generate new adversarial examples to attack a new target classifier. However, for white-box attacks, transferability is somehow related to their success in attacking previously unseen target networks. Thus, it is essential to have a high rate of transferability if a white-box attack is meant to be deployed in real-world situations where often, we do not have any access to internal nodes of a classifier.

C.4 Samples of Adversarial Examples

Figure 5 shows samples of adversarial examples generated by AdvFlow and 𝒩\mathcal{N}Attack [33], intended to attack a vanilla Wide-ResNet-32 [63]. As the images show, AdvFlow can generate adversarial perturbations that often take the shape of the original data. This property makes AdvFlows less detectable to adversarial example detectors. In contrast, it is clear that the perturbations generated by 𝒩\mathcal{N}Attack come from a different distribution than the data itself. As a result, they can be detected easily by adversarial example detectors.

Table 11: Attack success rate, average and median of the number of queries to generate an adversarial example for CIFAR-10 [30] and SVHN [40]. For a fair comparison, we first find the samples where all the attack methods are successful, and then compute the average (median) of queries for these samples. Note that for 𝒩\mathcal{N}Attack and AdvFlow we check whether we arrived at an adversarial point every 200200 queries, and hence, the medians are multiples of 200200. Clean data accuracy and PGD-100 attack success rate are also shown for reference. All attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255.
Arch Data Attack PGD-100 Bandits [23] / 𝒩\mathcal{N}Attack [33] / SimBA [16] / AdvFlow (ours)
Defense Clean Acc.(%) Success Rate(%) \uparrow Success Rate(%) \uparrow Avg. of Queries \downarrow Med. of Queries \downarrow
WideResNet32 [63] CIFAR-10 Vanilla 91.7791.77 100100 98.8198.81 / 𝟏𝟎𝟎\mathbf{100} / 99.9999.99 / 99.4299.42 552.69552.69 / 237.58\mathbf{237.58} / 237.70237.70 / 949.31949.31 182182 / 200200 / 𝟏𝟐𝟔\mathbf{126} / 400400
FreeAdv [48] 81.2981.29 47.5247.52 37.1237.12 / 38.9738.97 / 35.5235.52 / 41.21\mathbf{41.21} 1062.701062.70 / 874.91874.91 / 463.09463.09 / 421.63\mathbf{421.63} 354354 / 400400 / 244244 / 𝟐𝟎𝟎\mathbf{200}
FastAdv [60] 86.3386.33 46.3746.37 36.6036.60 / 36.9036.90 / 35.0735.07 / 40.22\mathbf{40.22} 1065.921065.92 / 973.05973.05 / 428.81\mathbf{428.81} / 436.80436.80 358358 / 400400 / 234234 / 𝟐𝟎𝟎\mathbf{200}
RotNetAdv [19] 86.5886.58 46.5946.59 37.7337.73 / 38.0438.04 / 35.6335.63 / 40.67\mathbf{40.67} 1085.431085.43 / 941.67941.67 / 471.99471.99 / 424.95\mathbf{424.95} 408408 / 400400 / 259259 / 𝟐𝟎𝟎\mathbf{200}
SVHN Vanilla 96.4596.45 99.8199.81 87.8487.84 / 98.76\mathbf{98.76} / 97.2697.26 / 90.3190.31 1750.651750.65 / 408.75408.75 / 202.07\mathbf{202.07} / 1572.241572.24 11281128 / 200200 / 𝟏𝟎𝟕\mathbf{107} / 600600
FreeAdv [48] 86.4786.47 57.2257.22 49.6449.64 / 50.2850.28 / 46.2846.28 / 50.76\mathbf{50.76} 819.98819.98 / 903.12903.12 / 365.42\mathbf{365.42} / 692.73692.73 250250 / 400400 / 216216 / 𝟐𝟎𝟎\mathbf{200}
FastAdv [60] 93.9093.90 46.7646.76 40.4340.43 / 35.4235.42 / 36.1936.19 / 41.49\mathbf{41.49} 755.23755.23 / 1243.381243.38 / 307.73\mathbf{307.73} / 526.37526.37 284284 / 600600 / 216216 / 𝟐𝟎𝟎\mathbf{200}
RotNetAdv [19] 90.3390.33 48.6748.67 43.4743.47 / 41.4941.49 / 39.0139.01 / 44.22\mathbf{44.22} 663.07663.07 / 756.48756.48 / 319.93\mathbf{319.93} / 480.02480.02 202202 / 400400 / 𝟏𝟖𝟔\mathbf{186} / 200200
ResNet50 [18] CIFAR-10 Vanilla 91.7591.75 100100 96.7596.75 / 99.8599.85 / 99.96\mathbf{99.96} / 99.3799.37 795.28795.28 / 252.13\mathbf{252.13} / 286.05286.05 / 1051.181051.18 280280 / 200200 / 𝟏𝟔𝟑\mathbf{163} / 600600
FreeAdv [48] 75.1775.17 54.5454.54 45.6445.64 / 46.4946.49 / 43.1443.14 / 49.46\mathbf{49.46} 842.56842.56 / 836.81836.81 / 383.56383.56 / 371.81\mathbf{371.81} 248248 / 400400 / 206206 / 𝟐𝟎𝟎\mathbf{200}
FastAdv [60] 79.0979.09 53.4553.45 45.2045.20 / 45.1945.19 / 43.5743.57 / 49.08\mathbf{49.08} 891.54891.54 / 901.44901.44 / 374.58374.58 / 359.21\mathbf{359.21} 248248 / 400400 / 𝟏𝟖𝟒\mathbf{184} / 200200
RotNetAdv [19] 76.3976.39 52.0452.04 45.8045.80 / 46.4146.41 / 42.6542.65 / 50.10\mathbf{50.10} 826.60826.60 / 774.24774.24 / 376.30376.30 / 292.74\mathbf{292.74} 232232 / 400400 / 𝟏𝟖𝟒\mathbf{184} / 200200
SVHN Vanilla 96.2396.23 99.3899.38 92.6392.63 / 96.73\mathbf{96.73} / 93.1493.14 / 83.6783.67 1338.301338.30 / 487.32487.32 / 250.02\mathbf{250.02} / 1749.481749.48 852852 / 200200 /𝟏𝟐𝟔\mathbf{126} / 800800
FreeAdv [48] 87.6787.67 46.5046.50 42.2742.27 / 43.9943.99 / 39.8339.83 / 44.66\mathbf{44.66} 793.30793.30 / 703.76703.76 / 327.30\mathbf{327.30} / 565.2565.2 𝟏𝟗𝟖\mathbf{198} / 400400 / 207207 / 200200
FastAdv [60] 92.6792.67 50.2550.25 43.2643.26 / 36.9936.99 / 38.9838.98 / 45.11\mathbf{45.11} 739.40739.40 / 1255.241255.24 / 286.71\mathbf{286.71} / 436.83436.83 294294 / 600600 / 202202 / 𝟐𝟎𝟎\mathbf{200}
RotNetAdv [19] 90.1590.15 48.3048.30 43.1743.17 / 40.3740.37 / 39.0039.00 / 43.96\mathbf{43.96} 660.81660.81 / 891.44891.44 / 312.47\mathbf{312.47} / 497.74497.74 𝟏𝟗𝟎\mathbf{190} / 400400 / 195195 / 200200
Refer to caption
CIFAR-10 (Wide-ResNet-32)
Refer to caption
SVHN (Wide-ResNet-32)
Refer to caption
CIFAR-10 (ResNet-50)
Refer to caption
SVHN (ResNet-50)
Figure 3: Success rate vs. maximum number of queries to attack CIFAR-10 [30] and SVHN [40] classifiers with Wide-ResNet-32 [63] and ResNet-50 [18] architectures.
Refer to caption
CIFAR-10 (𝒩{\mathcal{N}}Attack)
Refer to caption
CIFAR-10 (AdvFlow)
Refer to caption
SVHN (𝒩{\mathcal{N}}Attack)
Refer to caption
SVHN (AdvFlow)
Figure 4: Confusion matrix of transferability for adversarial attacks generated by AdvFlow (ours) and 𝒩{\mathcal{N}}Attack [33]. Each entry shows the success rate of adversarial examples originally generated for the row-wise classifier to attack the column-wise model. Also, the numbers 1 and 2 in the name of each classifier indicates whether it has a Wide-ResNet-32 [63] or ResNet-50 [18] architecture, respectively.
Refer to caption
CIFAR-10 [30]
Refer to caption
SVHN [40]
Figure 5: Magnified difference and adversarial examples generated by AdvFlow (ours) and 𝒩\mathcal{N}Attack [33] alongside the clean data. As can be seen, the adversaries generated by AdvFlow are better disguised in the data, while 𝒩\mathcal{N}Attack [33] look noisy (better viewed in digital format).

C.5 Training Data and Its Effects

In the closing paragraph of Section 3.1 we discussed that while we are using the same training data as the classifier for our flow-based models, it does not have any effect on the performance of generated adversarial examples. This claim is valid since we did not explicitly use this data to either generate adversarial examples or learn their features by which the classifier intends to distinguish them. To support our claim, we replicate the experiments of Table 2 for the case of CIFAR-10 [30]. This time, however, we select 90009000 of the test data and pre-train our normalizing flow on them. Then, we evaluate the performance of AdvFlow on the remaining 10001000 test data. We then report the attack success rate and the average number of queries in Table 12 for this new scenario (Scenario 2) vs. the original case (Scenario 1).

As can be seen, the performance does not change in general. The little differences between the two cases come from the fact that in Scenario 1, we had 5000050000 training data to train our flow-based model, while in Scenario 2 we only trained our model on 90009000 training data. Also, here we are evaluating AdvFlow performance on 10001000 test data in contrast to the whole 1000010000 test images used in assessing Scenario 1. Finally, it should be noted that we get even a more balanced relative performance for the fair case where we split the training data 50/5050/50 between classifier and flow-based model. However, since the performance of the classifier drops in this case, we only report the unfair situation here.

Table 12: Attack success rate and average (median) of the number of queries to generate an adversarial example for CIFAR-10 [30]. Scenario 1 corresponds to the case where we use the whole CIFAR-10 training data to train our normalizing flow. Scenario 2 indicates the experiment in which we train our flow-based model on 90009000 images from CIFAR-10 test data. The architecture of the classifier in all of the cases is Wide-ResNet-32 [63]. Also, all attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255.
Data Success Rate(%) \uparrow Avg. (Med.) of Queries \downarrow
Defense Scenario 1 Scenario 2 Scenario 1 Scenario 2
CIFAR-10 Vanilla 99.4299.42 98.9198.91 950.07950.07 (400400) 949.78949.78 (400400)
FreeAdv 41.2141.21 40.2240.22 923.58923.58 (200200) 962.31962.31 (200200)
FastAdv 40.2240.22 40.9340.93 963.77963.77 (200200) 1114.681114.68 (200200)
RotNetAdv 40.6740.67 40.3240.32 880.86880.86 (200200) 876.57876.57 (400400)

More interestingly, we observe that in case we train our flow-based model on some similar dataset to the original one, we can still get an acceptable relative performance. More specifically, we train our flow-based model on CIFAR-100 [30] dataset instead of CIFAR-10 [30]. Then, we perform AdvFlow on the test data of CIFAR-10 [30]. We know that despite being visually similar, these two datasets have their differences in terms of classes and samples per class.

Table 13 shows the performance of AdvFlow in this case where it is pre-trained on CIFAR-100 instead of CIFAR-10. As the results indicate, we can achieve a competitive performance despite our model being trained on a slightly different dataset. Furthermore, a few adversarial examples from this model are shown in Figure 6. We see that the perturbations are still more or less taking the shape of the data.

Table 13: Attack success rate and average (median) of the number of queries to generate an adversarial example for CIFAR-10 [30] test data. The train data row shows the data that is used for training the normalizing flow part of AdvFlow. The architecture of the classifier in all of the cases is Wide-ResNet-32 [63]. Also, all attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255.
Test Success Rate(%) \uparrow Avg. (Med.) of Queries \downarrow
Def/Train Data CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100
CIFAR-10 Vanilla 99.4299.42 98.7298.72 950.07950.07 (400400) 1198.031198.03 (600600)
FreeAdv 41.2141.21 39.9539.95 923.58923.58 (200200) 955.05955.05 (200200)
FastAdv 40.2240.22 38.8338.83 963.77963.77 (200200) 1017.661017.66 (200200)
RotNetAdv 40.6740.67 39.2839.28 880.86880.86 (200200) 910.55910.55 (200200)
Refer to caption
Figure 6: Magnified difference and adversarial examples generated by AdvFlow alongside the clean data using flow-based models trained on CIFAR-100 [30] (left) and CIFAR-10 [30] (right).

Appendix D AdvFlow and Its Variations

D.1 AdvFlow

Algorithm 1 summarizes the main adversarial attack approach introduced in this paper.

Algorithm 1 AdvFlow for inconspicuous black-box adversarial attacks

Input: Clean data 𝐱{\mathbf{x}}, true label y{y}, pre-trained flow-based model 𝐟(){\mathbf{f}(\cdot)}.
Output: Adversarial example 𝐱.{\mathbf{x}^{\prime}}.
Parameters: noise variance σ2{\sigma^{2}}, learning rate α{\alpha}, population size np{n_{p}}, maximum number of queries Q{Q}.

1:Initialize 𝝁{\boldsymbol{\mu}} randomly.
2:Compute 𝐳clean=𝐟1(𝐱){\mathbf{z}_{clean}=\mathbf{f}^{-1}(\mathbf{x})}.
3:for q=1,2,,Q/npq=1,2,\ldots,\lfloor Q/n_{p}\rfloor do
4:     Draw np{n_{p}} samples from 𝜹z=𝝁+σϵ{\boldsymbol{\delta}_{z}}=\boldsymbol{\mu}+\sigma\boldsymbol{\epsilon} where ϵ𝒩(ϵ|𝟎,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{\epsilon}|\mathbf{0},I).
5:     Set 𝐳k=𝐳clean+𝜹zk{\mathbf{z}_{k}=\mathbf{z}_{clean}+{\boldsymbol{\delta}_{z}}_{k}} for all k=1,,np{k=1,\ldots,n_{p}}.
6:     Calculate k=(proj𝒮(𝐟(𝐳k))){\mathcal{L}_{k}=\mathcal{L}\big{(}\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z}_{k})\big{)}\big{)}} for all k=1,,np{k=1,\ldots,n_{p}}.
7:     Normalize ^k=(kmean(𝓛))/std(𝓛){\hat{\mathcal{L}}_{k}=\big{(}\mathcal{L}_{k}-\mathrm{mean}(\boldsymbol{\mathcal{L}})\big{)}/\mathrm{std}(\boldsymbol{\mathcal{L}})}.
8:     Compute 𝝁J(𝝁,σ)=1npk=1np^kϵk{\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma)=\tfrac{1}{n_{p}}\sum_{k=1}^{n_{p}}\hat{\mathcal{L}}_{k}\boldsymbol{\epsilon}_{k}}.
9:     Update 𝝁𝝁α𝝁J(𝝁,σ){\boldsymbol{\mu}\leftarrow\boldsymbol{\mu}-\alpha\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma)}.
10:end for
11:Output 𝐱=proj𝒮(𝐟(𝐳clean+𝝁)){\mathbf{x}^{\prime}}=\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z}_{clean}+\boldsymbol{\mu})\big{)}.

D.2 Greedy AdvFlow [9]

We can modify Algorithm 1 so that it stops upon reaching a data point where it is adversarial. To this end, we only have to actively check whether we have generated a data sample for which the C&W cost of Eq. (1) is zero or not. Besides, instead of using all of the generated samples to update the mean of the latent Gaussian 𝜹z{\boldsymbol{\delta}_{z}}, we can select the top-KK for which the C&W loss is the lowest. Then, we update the mean by taking the average of these latent space data points. Applying these changes, we get a new algorithm coined Greedy AdvFlow. This approach is given in Algorithm 2.

Algorithm 2 Greedy AdvFlow for inconspicuous black-box adversarial attacks

Input: Clean data 𝐱{\mathbf{x}}, true label y{y}, pre-trained flow-based model 𝐟(){\mathbf{f}(\cdot)}.
Output: Adversarial example 𝐱.{\mathbf{x}^{\prime}}.
Parameters: noise variance σ2{\sigma^{2}}, voting population K{K}, population size np{n_{p}}, maximum number of queries Q{Q}.

1:Initialize 𝝁{\boldsymbol{\mu}} randomly.
2:Compute 𝐳clean=𝐟1(𝐱){\mathbf{z}_{clean}=\mathbf{f}^{-1}(\mathbf{x})}.
3:for q=1,2,,Q/npq=1,2,\ldots,\lfloor Q/n_{p}\rfloor do
4:     Draw np{n_{p}} samples from 𝜹z=𝝁+σϵ{\boldsymbol{\delta}_{z}}=\boldsymbol{\mu}+\sigma\boldsymbol{\epsilon} where ϵ𝒩(ϵ|𝟎,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{\epsilon}|\mathbf{0},I).
5:     Set 𝐳k=𝐳clean+𝜹zk{\mathbf{z}_{k}=\mathbf{z}_{clean}+{\boldsymbol{\delta}_{z}}_{k}} for all k=1,,np{k=1,\ldots,n_{p}}.
6:     Calculate k=(proj𝒮(𝐟(𝐳k))){\mathcal{L}_{k}=\mathcal{L}\big{(}\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z}_{k})\big{)}\big{)}} for all k=1,,np{k=1,\ldots,n_{p}}.
7:     if any k\mathcal{L}_{k} becomes 0 then:
8:         Output the 𝐱=proj𝒮(𝐟(𝐳k)){\mathbf{x}^{\prime}}=\mathrm{proj}_{\mathcal{S}}\big{(}\mathbf{f}(\mathbf{z}_{k})\big{)} for which k=0\mathcal{L}_{k}=0 as the adversarial example.
9:         break
10:     end if
11:     Find the top-KK samples 𝐳k{\mathbf{z}_{k}} with the lowest score k{\mathcal{L}_{k}}.
12:     Update 𝝁1KktopK𝜹zk{\boldsymbol{\mu}\leftarrow\tfrac{1}{K}\sum_{k\in\mathrm{top-}K}{\boldsymbol{\delta}_{z}}_{k}}.
13:end for

Table 14 shows the performance of the proposed method with respect to AdvFlow. As can be seen, this way we can improve the success rate and number of required queries.

Table 14: Attack success rate and average (median) of the number of queries to generate an adversarial example for CIFAR-10 [30], and SVHN [40]. The architecture of the classifier in all of the cases is Wide-ResNet-32 [63]. Also, all attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255. The hyperparameters used for Greedy AdvFlow are the same as Table 10. In each iteration, we select the top-4 data samples to update the mean.
Data Success Rate(%) \uparrow Avg. (Med.) of Queries \downarrow
Defense Greedy AdvFlow AdvFlow Greedy AdvFlow AdvFlow
CIFAR-10 Vanilla 99.1299.12 99.4299.42 991.98991.98 (460460) 950.07950.07 (400400)
FreeAdv 41.0641.06 41.2141.21 842.37842.37 (180180) 923.58923.58 (200200)
FastAdv 40.0640.06 40.2240.22 904.78904.78 (200200) 963.77963.77 (200200)
RotNetAdv 40.5040.50 40.6740.67 821.80821.80 (180180) 880.86880.86 (200200)
SVHN Vanilla 92.4092.40 90.4290.42 1305.061305.06 (540540) 1582.871582.87 (800800)
FreeAdv 52.5752.57 50.6350.63 816.54816.54 (200200) 1095.681095.68 (200200)
FastAdv 43.0343.03 41.3941.39 781.06781.06 (240240) 1046.451046.45 (400400)
RotNetAdv 45.4345.43 44.3744.37 653.82653.82 (160160) 923.59923.59 (200200)

D.3 AdvFlow for High-resolution Images

Despite their ease-of-use in generating low-resolution images, high-resolution image generation with normalizing flows is computationally demanding. This issue is even more pronounced in the case of images with high variabilities, such as the ImageNet [46] dataset, which may require a lot of invertible transformations to model them. To cope with this problem, we propose an adjustment to our AdvFlow algorithm. Instead of generating the image in the high-dimensional space, we first map it to a low-dimension space using bilinear interpolation. Then, we perform the AdvFlow algorithm to generate the set of candidate examples. Next, we compute the adversarial perturbations in the low-dimensional space and map them back to their high-dimensional representation using bilinear upsampling. These perturbations are then added to the original target image, and the rest of the algorithm continues as before. Figure 7 shows the block-diagram of the proposed solution for high-resolution data. Moreover, the updated AdvFlow procedure is summarized in Algorithm 3. Changes are highlighted in red.

Refer to captionDownsampler NF

Upsampler

+Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 7: AdvFlow adjustment for high-resolution images. Instead of working with high-dimensional image, we downsample them. Then after generating candidate low-resolution perturbations, we map them to high-dimensions using a bilinear upsampler.
Algorithm 3 AdvFlow for high-resolution black-box attack

Input: Clean data 𝐱{\mathbf{x}}, true label y{y}, pre-trained flow-based model 𝐟(){\mathbf{f}(\cdot)}.
Output: Adversarial example 𝐱.{\mathbf{x}^{\prime}}.
Parameters: noise variance σ2{\sigma^{2}}, learning rate α{\alpha}, population size np{n_{p}}, maximum number of queries Q{Q}.

1:Initialize 𝝁{\boldsymbol{\mu}} randomly.
2:Downsample 𝐱\mathbf{x} and save it as 𝐱low\mathbf{x}_{\mathrm{low}}.
3:Compute 𝐳clean=𝐟1(𝐱low){\mathbf{z}_{clean}=\mathbf{f}^{-1}(\mathbf{x}_{\mathrm{low}})}.
4:for q=1,2,,Q/npq=1,2,\ldots,\lfloor Q/n_{p}\rfloor do
5:     Draw np{n_{p}} samples from 𝜹z=𝝁+σϵ{\boldsymbol{\delta}_{z}}=\boldsymbol{\mu}+\sigma\boldsymbol{\epsilon} where ϵ𝒩(ϵ|𝟎,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{\epsilon}|\mathbf{0},I).
6:     Set 𝐳k=𝐳clean+𝜹zk{\mathbf{z}_{k}=\mathbf{z}_{clean}+{\boldsymbol{\delta}_{z}}_{k}} for all k=1,,np{k=1,\ldots,n_{p}}.
7:     Compute and upsample 𝜸k=𝐟(𝐳k)𝐱low{\boldsymbol{\gamma}_{k}=\mathbf{f}(\mathbf{z}_{k})-\mathbf{x}_{\mathrm{low}}} for all k=1,,np{k=1,\ldots,n_{p}}.
8:     Calculate k=(proj𝒮(𝜸k+𝐱)){\mathcal{L}_{k}=\mathcal{L}\big{(}\mathrm{proj}_{\mathcal{S}}\big{(}\boldsymbol{\gamma}_{k}+\mathbf{x}\big{)}\big{)}} for all k=1,,np{k=1,\ldots,n_{p}}.
9:     Normalize ^k=(kmean(𝓛))/std(𝓛){\hat{\mathcal{L}}_{k}=\big{(}\mathcal{L}_{k}-\mathrm{mean}(\boldsymbol{\mathcal{L}})\big{)}/\mathrm{std}(\boldsymbol{\mathcal{L}})}.
10:     Compute 𝝁J(𝝁,σ)=1npk=1np^kϵk{\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma)=\tfrac{1}{n_{p}}\sum_{k=1}^{n_{p}}\hat{\mathcal{L}}_{k}\boldsymbol{\epsilon}_{k}}.
11:     Update 𝝁𝝁α𝝁J(𝝁,σ){\boldsymbol{\mu}\leftarrow\boldsymbol{\mu}-\alpha\nabla_{\boldsymbol{\mu}}J(\boldsymbol{\mu},\sigma)}.
12:end for
13:Upsample 𝜸=𝐟(𝐳clean+𝝁)𝐱low{\boldsymbol{\gamma}=\mathbf{f}(\mathbf{z}_{clean}+\boldsymbol{\mu})}-\mathbf{x}_{\mathrm{low}}.
14:Output 𝐱=proj𝒮(𝜸+𝐱){\mathbf{x}^{\prime}}=\mathrm{proj}_{\mathcal{S}}\big{(}\boldsymbol{\gamma}+\mathbf{x}\big{)}.

To test the performance of the proposed approach, we pre-train a flow-based model on a 64×6464\times 64 version of ImageNet [46]. The normalizing flow architecture and training hyperparameters are as shown in Table 4. Furthermore, we use bandits with data-dependent priors [23] and 𝒩\mathcal{N}Attack [33] to compare our model against them. For bandits, we use the tuned hyperparameters for ImageNet [46] in the original paper [23]. Also, for 𝒩\mathcal{N}Attack [33] and AdvFlow we observe that the hyperparameters in Tables 8 and 10 work best for the vanilla architectures in this dataset. Thus, we kept them as before.

We use the nominated black-box methods to attack a classifier in less than 10,00010,000 queries. We use pre-trained Inception-v3 [52], ResNet-50 [18], and VGG-16 [50] classifiers available on torchvision as our vanilla target models. Also, a defended ResNet-50 [18] model trained by fast adversarial training [60] with FGSM (ϵ=4/255\epsilon=4/255) is used for evaluation. This model is available online on the official repository of fast adversarial training.111111github.com/anonymous-sushi-armadillo

Table 15 shows our experimental results on the ImageNet [46] dataset. As can be seen, we get similar results to CIFAR-10 [30] and SVHN [40] experiments in Table 2: in case of vanilla architectures, we are performing slightly worse than 𝒩\mathcal{N}Attack [33], while in the defended case we improve their performance considerably. It should also be noted again that in all of the cases we are generating adversaries that look like the original data, and come from the same distribution. This property is desirable in confronting adversarial example detectors. Figure 8 depicts a few adversarial examples generated by AdvFlow compared to 𝒩\mathcal{N}Attack [33] for a vanilla Inception-v3 [52] DNN classifier. As seen, the AdvFlow perturbations tend to take the shape of the data to reduce the possibility of changing the underlying data distribution. In contrast, 𝒩\mathcal{N}Attack perturbations are pixel-level, independent additive noise that cause the adversarial example distribution to become different from that of the data.

Table 15: Attack success rate and average (median) of the number of queries needed to generate an adversarial example for ImageNet [46]. For a fair comparison, we first find the samples where all the attack methods are successful, and then compute the average (median) of queries for these samples. Note that for 𝒩\mathcal{N}Attack and AdvFlow we check whether we arrived at an adversarial point every 200200 queries, hence the medians are multiples of 200200. All attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255. * The accuracy is computed with respect to the 10001000 test data used for attack evaluation.
Data Attack Bandits [23] / 𝒩\mathcal{N}Attack [33] / SimBA [16] / AdvFlow (ours)
Arch. Acc*(%) Success Rate(%) \uparrow Avg. (Med.) of Queries \downarrow
ImageNet Inception-v3 99.2099.20 87.8087.80 / 95.06\mathbf{95.06} / 80.9580.95 / 87.5087.50 1034.411034.41 (430430) / 680.52\mathbf{680.52} (𝟒𝟎𝟎\mathbf{400}) / 1481.121481.12 (11421142) / 1516.341516.34 (800800)
VGG16 92.5092.50 95.4695.46 / 99.57\mathbf{99.57} / 97.9597.95 / 97.5197.51 541.64541.64 (𝟏𝟔𝟔\mathbf{166}) / 395.61\mathbf{395.61} (200200) / 608.42608.42 (486486) / 1239.031239.03 (600600)
Van. ResNet 95.0095.00 95.7995.79 / 99.47\mathbf{99.47} / 98.4298.42 / 95.5895.58 948.90948.90 (𝟑𝟔𝟒\mathbf{364}) / 604.31\mathbf{604.31} (400400) / 701.92701.92 (494494) / 1501.131501.13 (800800)
Def. ResNet 71.5071.50 50.7750.77 / 33.9933.99 / 47.5547.55 / 57.20\mathbf{57.20} 914.58914.58 (404404) / 2170.822170.82 (12001200) / 969.91969.91 (696696) / 381.97\mathbf{381.97} (𝟐𝟎𝟎\mathbf{200})
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
AdvFlow (ours)
Refer to caption
Clean Image
Refer to caption
𝒩\mathcal{N}Attack
Refer to caption
Figure 8: Magnified difference and adversarial examples generated by AdvFlow (ours) and 𝒩\mathcal{N}Attack [33] alongside the clean data for ImageNet [46] dataset. The target network architecture is Inception-v3 [52].

D.4 AdvFlow for People in a Hurry!

Alternatively, one can use the plain structure of AdvFlows for black-box adversarial attacks. To this end, we are only required to initialize the normalizing flow randomly. This way, however, we will be getting random-like perturbations as in 𝒩\mathcal{N}Attack [33] since we have not trained the flow-based model. Using this approach, we can surpass the performance of the baselines in vanilla DNNs. In fact, giving away a little bit of performance is the price we pay to force the perturbation to have a data-like structure so that the adversaries have a similar distribution to the data. Table 16 summarizes the performance of randomly initialized AdvFlow in contrast to bandits with data-dependent priors [23] and 𝒩\mathcal{N}Attack [33] for vanilla ImageNet [46] classifiers. Also, Figure 9 shows a few adversarial examples generated by randomly initialized AdvFlows in this case.

Table 16: Attack success rate and average (median) of the number of queries needed to generate an adversarial example for ImageNet [46]. For a fair comparison, we first find the samples where all the attack methods are successful, and then compute the average (median) of queries for these samples. Note that for 𝒩\mathcal{N}Attack and AdvFlow we check whether we arrived at an adversarial point every 200200 queries, hence the medians are multiples of 200200. All attacks are with respect to \ell_{\infty} norm with ϵmax=8/255\epsilon_{\max}=8/255. * The accuracy is computed with respect to the 10001000 test data used for attack evaluation.
Attack Bandits [23] / 𝒩\mathcal{N}Attack [33] / SimBA [16] / AdvFlow (Random Init.)
Arch. Acc*(%) Success Rate(%) \uparrow Avg. (Med.) of Queries \downarrow
Inception-v3 99.2099.20 87.8087.80 / 95.0695.06 / 80.9580.95 / 97.78\mathbf{97.78} 1081.131081.13 (452452) / 745.10745.10 (400400) / 1537.711537.71 (11731173) / 375.00\mathbf{375.00} (𝟐𝟎𝟎\mathbf{200})
VGG16 92.5092.50 95.4695.46 / 99.57\mathbf{99.57} / 97.9597.95 / 99.3599.35 586.14586.14 (174174) / 418.33418.33 (𝟐𝟎𝟎\mathbf{200}) / 637.96637.96 (503503) / 299.89\mathbf{299.89} (𝟐𝟎𝟎\mathbf{200})
Van. ResNet 95.0095.00 95.7995.79 / 99.47\mathbf{99.47} / 99.1699.16 / 98.4298.42 1053.441053.44 (415415) / 672.53672.53 (400400) / 758.32758.32 (523523) / 370.63\mathbf{370.63} (𝟐𝟎𝟎\mathbf{200})
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Pre-trained AdvFlow
Refer to caption
Clean Image
Refer to caption
Random AdvFlow
Refer to caption
Figure 9: Magnified difference and adversarial examples generated by AdvFlow in trained and un-trained scenarios for ImageNet [46] dataset. The target network architecture is Inception-v3 [52].