This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Noise Modulation: Let Your Model Interpret Itself

Haoyang Li    Xinggang Wang
Huazhong University of Science and Technology
Wuhan, Hubei, China
{lihaoyang,xgwang}@hust.edu.cn
Abstract

Given the great success of Deep Neural Networks(DNNs) and the black-box nature of it, the interpretability of these models becomes an important issue. The majority of previous research works on the post-hoc interpretation of a trained model. But recently, adversarial training shows that it is possible for a model to have an interpretable input-gradient through training. However, adversarial training lacks efficiency for interpretability. To resolve this problem, we construct an approximation of the adversarial perturbations and discover a connection between adversarial training and amplitude modulation. Based on a digital analogy, we propose noise modulation as an efficient and model-agnostic alternative to train a model that interprets itself with input-gradients. Experiment results show that noise modulation can effectively increase the interpretability of input-gradients model-agnosticly.

1 Introduction

Refer to caption
Figure 1: Visualization of the input-gradients, i.e. the loss gradient with respect to input pixels. The gradients are scaled to [0,1][0,1] for visualization. The first column are original inputs. The second, third and fourth columns are input-gradients of the same model acquired with standard training, noise modulation and adversarial training respectively. We propose an efficient and model-agnostic alternative to recover the interpretable input-gradients brought by adversarial training.

Deep Neural Networks (DNNs) have demonstrated huge potential in solving multiple visual recognition problems, e.g., image classification [16], object detection [34] and segmentation [29]. But the back-box nature of them hinders their further application in scenarios where the decisions can result in danger, such as medical prognosis, autonomous driving and verification [24]. Given that these successful models are now being deployed into daily products, it’s important to explain their predictions to ensure reliability, robustness and fairness [3].

There are three major approaches to resolve this lack of interpretability [3], the first is to use an inherently interpretable model, e.g. linear model, which results a limited predictive power; the second is to design a model that generates predictions and explanations together, which is challenging due to the lack of ground-truth explanations; the third is to give post-hoc interpretations for a trained model, which is the prevailing approach [3].

The study of interpretability also leads to the discovery of adversarial example [28]. Based on this discovery, adversarial attacks, i.e. generating adversarial examples to cheat the targeted model, have evolved into a striking threat against systems equipped with these models [31]. To defend from these attacks, many methods have been proposed and breached in the past few years [4] [13] [5]. It is adversarial training [11] [19], i.e. augmenting training set with adversarial examples, that survives through sanity checks and adaptive attacks [4] and becomes the leading defense method till today.

When [30] checks the models acquired through adversarial training, it is surprising that these models now have a clear and interpretable input-gradient, i.e. the loss gradient with respect to inputs, as further confirmed by  [33]. It occurs to us that adversarial training actually gives a regular model the ability to generate interpretable input-gradients, in other words, it is possible to train a regular model that interprets itself with input-gradients [9].

But the sideeffects of adversarial training is overwhelming, as reported in literature, it requires more data [22], larger model  [32] and much more computations to achieve an accuracy comparable but inferior to standardly trained models [30]. It lacks of efficiency in terms of interpretability. This leads us to the following question we study in this paper: Is there an efficient alternative to train a model that interprets itself with input-gradients?

To address this question, we start by looking for an efficient approximation of adversarial perturbations used for adversarial training. It surprisingly links to the technique of amplitude modulation. In communication, amplitude modulation refers to having the amplitude of the carrier wave vary in proportion to that of the message signal. Based on a digital analogy, we propose noise modulation as an efficient and model-agnostic alternative to train a model with interpretable input-gradients.

Our contributions are summarized as follows:

  • We discover a connection between adversarial training and amplitude modulation through an approximation of adversarial perturbations. It offers a new perspective to understand the effects of adversarial training on input-gradients.

  • We propose an efficient and model-agnostic alternative, namely noise modulation, to train a model that interprets itself with clear and human-aligned input-gradients.

2 Motivation & Approach

2.1 How Adversarial Training Works

Adversarial training was first proposed in [11], but the most popular and effective formulation is proposed by  [19]. The core idea of adversarial training is to add adversarial examples into training set, [19] even substitute the whole training set with adversarial examples.

Given a dataset D={(x,y)|xN,y}D=\{(x,y)|x\in\mathbb{R}^{N},y\in\mathbb{N}\} containing pairs of input xx and label yy, a model ff parametrized with θ\theta and a loss function LL, adversarial training refers to solving the following minimax optimization problem. Following the formulation of [19],

minθ𝔼(x,y)𝒟[maxδB(ε)L(f(θ;x+δ),y)].\min_{\theta}\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\max_{\delta\in B(\varepsilon)}L(f(\theta;x+\delta),y)\right]. (1)

At each step, the inner maximization first searches for an optimal perturbation δ\delta^{*} in a given attack space B(ε)B(\varepsilon), e.g. a ll_{\infty}-norm ball {δ|δN,ε,δε}\{\delta|\delta\in\mathbb{R}^{N},\varepsilon\in\mathbb{R},||\delta||_{\infty}\leq\varepsilon\}, such that the loss function is maximized; the outer minimization then optimizes the parameters θ\theta to minimize the loss function just like standard training, but on perturbed examples x+δx+\delta^{*}.

The inner optimization is commonly solved using Projected Gradient Descent (PGD) attack and its variants. It is an iterative method that at each epoch t[T]t\in[T] (TT epochs in total), the perturbation is moved towards the direction of input-gradient for a certain step and projected back to the feasible space, i.e.

δt+1=PB(ε)(δt+αsign(x+δtL(f(θ;x+δt),y)))\displaystyle\delta^{t+1}=P_{B(\varepsilon)}(\delta^{t}+\alpha\text{sign}(\nabla_{x+\delta^{t}}L(f(\theta;x+\delta^{t}),y))) (2)
δ0𝒰(ε,ε),\displaystyle\delta^{0}\sim\mathcal{U}(-\varepsilon,-\varepsilon),

where PB(ε)P_{B(\varepsilon)} denotes projecting the perturbations back to the attack space, α\alpha\in\mathbb{R} denotes the step size at each iteration. After TT epochs, the adversarial example x=x+δx^{*}=x+\delta^{*} perturbed with the optimal perturbations δ\delta^{*}, is further fed for training. It only changes the training strategy, at inference, the model still gives its predictions on clean inputs.

The heavy computations of adversarial training mainly stems from the computation of input-gradients at each iteration of PGD. For adversarial training with TT epochs of PGD, there are TT more back-propagation iterations computed than a standard training. Other sideeffects of adversarial training, e.g. requiring larger model and more data comes from the adversarial nature of these perturbations. These perturbations are designed to move the original examples towards the decision boundary, making the resulted adversarial examples much more difficult to fit for model [30]. The increased difficulty requires a larger model to fit given the same amount of data [32], this further increases the computational overheads as the computation of back-propagation also grows with the scale of model.

2.2 Approximation of Adversarial Perturbations

Refer to caption
Figure 2: The process of noise modulation. For each input, a noise is sampled first and then multiplied with input twice. The demodulated input is rescaled to keep the constant component stasis and then fed for training. We pictured the change of inputs in space and spectrum (the spectrum is scaled to [0,1][0,1] for better visualization). The discriminative feature is hidden on amplitude after modulating, but then recovered by demodulating.

In this section, we will draw an approximation of the adversarial perturbations in an efficient way. The first step, as initially proposed by [11], is to substitute the iterative multi-step PGD attack with its one step approximation, i.e.

δPB(ε)(δ0+αsign(x+δ0L(f(θ;x+δ0),y))).\displaystyle\delta^{*}\sim P_{B(\varepsilon)}(\delta^{0}+\alpha\text{sign}(\nabla_{x+\delta^{0}}L(f(\theta;x+\delta^{0}),y))). (3)

The perturbation now becomes a projected linear combination of a uniform noise and the sign of input-gradient. Our second step is to find a way to approximate the input-gradient. It has been observed that adversarial training makes the input-gradients similar to the corresponding inputs [6]. This can be verified by visualizing the input-gradients of adversarially trained model, as shown in the fourth column in Figure 1.

This similarity drives us to hypothesize that the sign of input-gradient can be approximated roughly by a multiplication of the corresponding input and some noise δN\delta^{\prime}\in\mathbb{R}^{N} 111Apparently, the sign of input-gradient is not the same as corresponding input., i.e. sign(xL)δx\text{sign}(\nabla_{x}L)\sim\delta^{\prime}\cdot x. Take it back to Equation 3, the perturbation now becomes

δPB(ε)(δ+αδx).\displaystyle\delta^{*}\sim P_{B(\varepsilon)}(\delta+\alpha\cdot\delta^{\prime}\cdot x). (4)

This approximation of Equation 4 actually estimates the adversarial perturbation with a modified input. It is approximately the case after adversarial training, as reported in [2]. If we take this approximation back into adversarial training, the example xx^{*} fed for training becomes

x\displaystyle x^{*} =x+δ\displaystyle=x+\delta^{*} (5)
Px+B(ε)(x(1+αδ)+δ).\displaystyle\sim P_{x+B(\varepsilon)}(x\cdot(1+\alpha\cdot\delta^{\prime})+\delta).

For our purpose, the perturbations are not intended to be adversarial, therefore, we further remove the components that are irrelated to input. The final approximation will be

xx(1+αδ).x^{*}\sim x\cdot(1+\alpha\cdot\delta^{\prime}). (6)

This formulation reminds us of the technique of amplitude modulation. In communication, amplitude modulation refers to having the amplitude of a carrier signal, generally of high-frequency, vary in proportion to that of the message signal, i.e. the signal that contains the information we wish to transmit, before transmission. At the receiver, the modulated signal is first demodulated with carrier of the same frequency, and then filtered through a low-passing filter to recover the original signal. From this perspective, adversarial training works as a dynamic modulator that tunes the model to filter out the informative components of frequencies by itself.

It motivates us to rethink this formulation from the perspective of modulation, and leads us to noise modulation we will propose in the next section.

2.3 Noise Modulation

For a digital signal like an image, we can also modulate it over a digital carrier of the same size and then demodulate the modulated signal using the same carrier. If we feed this demodulated signal for training, the model will have to learn to filter out informative components of frequencies related to its purpose. Based on this digital analogy of amplitude modulation, we propose modulational training as formulated below.

Definition 1

Modulational Training Given a dataset D={(x,y)|xN,y}D=\{(x,y)|x\in\mathbb{R}^{N},y\in\mathbb{N}\}, a model ff prametrized with θ\theta, a loss function measuring the distance between predictions f(x)f(x) and true label yy, a carrier cN,c0c\in\mathbb{R}^{N},c\neq 0, modulational training refers to solving the following optimization problem

minθ𝔼(x,y)𝒟[L(f(θ;xNc2C0),y)]\displaystyle\min_{\theta}\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[L(f(\theta;x\cdot\frac{N\cdot c^{2}}{C_{0}}),y)\right] (7)
C0=n=0N1cn2,\displaystyle C_{0}=\sum_{n=0}^{N-1}c_{n}^{2},

when c=𝟏c=\mathbf{1}, it becomes standard training.

It seems quite different from Equation 6, but we will show that they are the same in essence. For the second power of a non-zero carrier c2={cn2},cnc^{2}=\{c_{n}^{2}\},c_{n}\in\mathbb{R}, the kk-th component of its Discrete Fourier Transform (DFT) C={Ck},CkC=\{C_{k}\},C_{k}\in\mathbb{C} is

Ck=n=0N1cn2ej2πknN.C_{k}=\sum_{n=0}^{N-1}c_{n}^{2}\text{e}^{-j\frac{2\pi kn}{N}}. (8)

The constant component C0=n=0N1cn2C_{0}=\sum_{n=0}^{N-1}c_{n}^{2} is positive, as long as c0c\neq 0. Thus the second power of this non-zero carrier can be rewritten as a combination of a constant λ=C0N\lambda=\frac{C_{0}}{N} and some noise δ′′N\delta^{\prime\prime}\in\mathbb{R}^{N},i.e.

c2=λ+δ′′xNc2C0=x(1+δ′′NC0).c^{2}=\lambda+\delta^{\prime\prime}\implies x\cdot\frac{N\cdot c^{2}}{C_{0}}=x\cdot(1+\delta^{\prime\prime}\cdot\frac{N}{C_{0}}). (9)

This formulation links back to our approximation of adversarial training (Equation 6). It indicates that it is possible to recover the original signal by filtering out the extra noises. Since the filtering of a certain frequency is a linear operation, in fact, a convolution, it is learnable by DNNs and even better for Convolutional Neural Networks(CNNs).

We still do not know which carrier can completely recover the interpretable input-gradients brought by adversarial training. But as we strives for efficiency, we propose to use noise as carrier and name this special case of modulational training as noise modulation.

Definition 2

Noise Modulation Given a a noise δN\delta\in\mathbb{R}^{N}, e.g. standard Gaussian noise δ𝒩(0,𝐈N)\delta\sim\mathcal{N}(0,\mathbf{I}_{N}), noise modulation refers to solving the following optimization problem 222the rest notations are the same with those in Definition 1..

minθ𝔼(x,y)𝒟[L(f(θ;xNc2C0),y)]\displaystyle\min_{\theta}\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[L(f(\theta;x\cdot\frac{N\cdot c^{2}}{C_{0}}),y)\right] (10)
c=β+(1β)δ,C0=n=0N1cn2,\displaystyle c=\beta+(1-\beta)\cdot\delta,C_{0}=\sum_{n=0}^{N-1}c_{n}^{2},

where β[0,1]\beta\in[0,1] is a hyper-parameter that controls the ratio of constant component in carrier cc. When β=1\beta=1, it becomes standard training.

The process of noise modulation is illustrated in Figure 2. For each input, a noise is sampled and then multiplied with the input twice. The demodulated input is rescaled to keep the constant components stasis and then fed for further training. The modulated input hides the discriminative features on the amplitude and these features are then brought back by demodulation. At inference, there is no processing for inputs, just like adversarial training.

As shown in Definition 2, noise modulation requires no extra back-propagation iterations than standard training. Since the noise is sampled independently from model, it is model-agnostic by design and can be further accelerated by prepossessing the dataset before training. Its computation grows with the dimension of data rather than the scale of model. Theoretically, it is much more efficient than adversarial training.

3 Experiments

Refer to caption
Refer to caption
Figure 3: Experiment results on CIFAR-10 (up) and MNIST(bottom). Each color marks a model. The dotted frame marks for the inputs. Noie modulation improves the interpretability of input-gradients significantly.

In this section, we will first define a metric to measure the visual interpretability of input-gradients such that we can validate the effectiveness of noise modulation on input-gradients both qualitatively and quantitatively. We will then show the trade-off between interpretability and accuracy, and the influences of different choices of noises used in noise modulation.

All of the following experiments are conducted using PyTorch [20] with a random seed fixed at 0. The hyper-parameters are set as the default unless specially mentioned.

3.1 Evaluating Input-gradients

In terms of interpretable input-gradients, we mean that the input-gradient of model serves as an attribution map telling us how each dimension of an input influences the model’s decision. As proved by the existence of adversarial examples  [28], moving the input towards the direction of its corresponding input-gradients increases the loss function. It indicates that input-gradients contain information crucial to the decision of model while not necessarily making sense for human.

Qualitatively, the input-gradient is more interpretable if it is more visually aligned with input [9]. Given the fact that numerically the input-gradient is much more smaller than input, we will use the absolute cosine similarity over their signs, to define a Visual Interpretability of Input-gradient(VII) as follows.

Definition 3

Visual Interpretability of Input-gradient (VII) Given a differentiable model ff and a loss function LL,e.g. cross-entropy , measuring the distance between its prediction and ground truth, its visual interpretability of input-gradients over a test dataset 𝒟={(x,y)|xN,y}\mathcal{D}=\{(x,y)|x\in\mathbb{R}^{N},y\in\mathbb{N}\}, denoted as VII(L,f,𝒟)\text{VII}(L,f,\mathcal{D}) is defined as follows

VII(L,f,𝒟)=𝔼(x,y)𝒟|dx,gx|dxgx\displaystyle\text{VII}(L,f,\mathcal{D})=\mathbb{E}_{(x,y)\sim\mathcal{D}}\frac{|\left<d_{x},g_{x}\right>|}{||d_{x}||\cdot||g_{x}||} (11)
dx=sign(xx¯)\displaystyle d_{x}=\text{sign}(x-\bar{x})
gx=sign(xL(f(x),y)),\displaystyle g_{x}=\text{sign}(\nabla_{x}L(f(x),y)),

where x¯\bar{x} denotes the average of xx over the whole dataset.

A higher VII indicates that the model has a more interpretable input-gradient aligned with input over the test set under the certain loss. In the following sections, we will use cross-entropy as the loss function to calculate this metric. The input-gradients will be scaled to [0,1][0,1] for visualization.

3.2 Effectiveness of Noise Modulation

Refer to caption
Figure 4: Trade-off between visual interpretability of input-gradients and accuracy. The dotted frame marks the inputs. The effects of noise modulation plateaus and resonates when β\beta is smaller than 0.60.6. When β=0.8\beta=0.8, the input-gradients become just visually interpretable.
Refer to caption
Figure 5: The influences of noises.The dotted frame marks the inputs. The corresponding noises are scaled to [0,1][0,1] and visualized on the right with corresponding colors and markers. All of these noises can acquire an interpretable input-gradients. Gaussian noise has the best overall performance as expected. Rayleigh noise maintains most of the accuracies with more compromisement of interpretability. Laplacian noise achieves the best interpretability, as well as the lowest accuracy.

In this section, we will demonstrate the effectiveness of noise modulation over input-gradients. We will evaluate our method on MNIST [18] and CIFAR-10 [10]. We will use two simple models, a three layer Multi-Layer Perceptron (MLP) and a six layer Fully Convolutional Neural Networks(FCNN) on all of these datasets. Besides, we will also evaluate LeNet [17] on MNIST, VGG [25] and ResNet [12] on CIFAR-10.

Since the focus of this paper is NOT the state-of-art performance of these models, for fair comparison, we will use a basic setting without fine-tuning over each models. All of these models are trained from scratch using a mini-batch size of 64 for 50 epochs. The optimizer is Adam [15] with a fixed learning rate of 0.001.

All of the inputs are first scaled to [0,1][0,1] and prepossessed with normalization, no extra data augmentation is used. For noise modulation, the ratio of constant component β\beta is set to 0.50.5 and the noise used as carrier is a standard Gaussian noise. The model with highest validating accuracy during training is saved for comparison.

The experiment results are presented in Figure 3. We observe that standard models already produce basically interpretable input-gradients for human on MNIST, but on CIFAR-10, these standard models produce input-gradients with irregular noises that make no sense for human. Noise modulation significantly increases the interpretability of input-gradients with a few accuracy cost, except for MLP on MNIST. The MLP trained standardly on MNIST already has an input-gradient with most of its non-zero components centered around the digits, noise modulation focuses its gradients around some critical points on digits but leaves a more noisy background that reduces the value of VII metric.

We also observe that a larger model is more capable of producing interpretable input-gradients, but a simple FCNN is also able to produce a reasonable input-gradient for human. It is clear that noise modulation is effective on increasing the interpretability of input-gradients. More results can be found in Figure 6 at the end of this paper.

3.3 Trade Interpretability for Accuracy

While the interpretability increases, noise modulation still brings a drop of accuracy. The design of constant component in Definition 2 is meant to deal with this problem.

On the same ResNet-18 we use in Section 3.2, we will change the ratio of constant component β\beta in carrier and check its influences on accuracy and interpretability. Theoretically, a larger β\beta indicates a noisier input-gradient and a higher accuracy as noise modulation becomes standard training when β=1\beta=1,

The experiment results are presented in Figure 4. As expected, a larger β\beta yields a higher accuracy and a lower visual interpretability of input-gradients. From human perception, the input-gradients become roughly interpretable when β\beta is smaller than 0.80.8. Reducing β\beta makes the interpretability increase until it plateaus and resonates once the constant component of carrier is smaller than 0.50.5 and the noise dominates carrier.

Given these experiment results, we recommend to set the constant component ratio β\beta as 0.80.8 for a basically interpretable input-gradient and 0.50.5 to exploit the best interpretability over a compromisement with accuracy.

3.4 Choice of Noises

The choice of noises is another hyper-parameter in noise modulation. Intuitively, a Gaussian noise spanning the whole spectrum is the proper choice, as we intend to have the model learn to filter out the informative features by itself,

Besides standard Gaussian noise, we will test other five noises on the same ResNet-18 in Section 3.2 with every condition fixed but the type of noise. These noises include Uniform noise, Laplacian noise, Gamma noise, Exponential noise and Rayleigh noise. The Uniform noise is sampled from [0,1]N[0,1]^{N}. The distribution of Laplacian noise peaks at 0, and the exponential decay of it is set as 11. The shape of the distribution of Gamma noise is set to 11, as well as its scale. For Exponential noise and Rayleigh noise, the scale is also set to 11.

The experiment results are presented in Figure 5. All of these noises yield a much more interpretable input-gradients. As expected, Gaussian noise has the best overall performance. The accuracy loss brought by Rayleigh noise is the least, as well as the increasement of interpretability. Laplacian noise brings the highest increasement of interpretability but sacrifies most of the accuracies too.

4 Related Work

Adversarial Training and Interpretability Adversarial training was initially proposed by  [11] as a defense against adversarial attack, but the most popular and effective variation is the PGD adversarial training proposed by  [19]. The discovery that adversarial training results a more interpretable input-gradient was first mentioned in  [30].  [33] gave an empirical interpretation of the adversarially trained model and further confirmed the increased interpretability brought by adversarial training. More theoretical analysis about the mechanism of adversarial training can be found in  [2].  [9]  [14] discussed the connection between adversarial robustness and interpretability.  [8] tried to improve the interpretability of model using adversarial training.  [6]  [21]  [7] tried to utilize this similarity between input-gradients and inputs to increase adversarial robustness. To our knowledge, we make the first step to exploit this similarity to increase the interpretability of models without adversarial training.

Input-gradients and Interpretability The existence of tiny adversarial perturbations found easily through gradient ascent [28] indicates that input-gradients of model contain information critical to the predictions of model, but the vanilla input-gradients of a standard model generally gives no interpretable information for human [27]. Many interpreting methods try to process vanilla gradients with some extra operations to give an interpretable heatmap for human, including multiplying the gradient with input(Gradient*Input method [24]), integrating the gradients (Integrated Gradients  [27]), averaging gradients of perturbed inputs (SmoothGrad [26]), combining gradients with features (GradCAM [23]) etc. All of these heatmaps, except for GradCAM and vanilla gradients, have been shown by  [1] that the results are independent from both data, model and model parameters. However, the input-gradient has a higher resolution compared with GradCAM, once made interpretable, it gives a more precise explanation for the model’s decision.

Refer to caption
Figure 6: Visualization of 80 more input-gradients with corresponding inputs for the model trained with noise modulation in Section 3.2.

5 Conclusion

In this paper, we connect adversarial training with amplitude modulation through an approximation of adversarial perturbations. It offers a new perspective to understand the effects of adversarial training on input-gradients. Based on a digital analogy of amplitude modulation, we propose noise modulation as an efficient and model-agnostic alternative to train a model that interprets itself with clear and human-aligned input-gradients. We confirm the effectiveness of noise modulation on input-gradients, draw a trade-off between interpretability and accuracy and analyze the effects of different choices of noises. We believe this work will serve as a baseline towards efficient training strategies that grants model with interpretability without hurting its ability.

References

  • [1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NeurIPS, pages 9525–9536, 2018.
  • [2] Zeyuan Allen-Zhu and Yuanzhi Li. Feature purification: How adversarial training performs robust deep learning. CoRR, abs/2005.10190, 2020.
  • [3] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Gradient-Based Attribution Methods, pages 169–191. Springer International Publishing, Cham, 2019.
  • [4] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 274–283. PMLR, 2018.
  • [5] Nicholas Carlini and David A. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In AISec@CCS, pages 3–14. ACM, 2017.
  • [6] Alvin Chan, Yi Tay, Yew-Soon Ong, and Jie Fu. Jacobian adversarially regularized networks for robustness. In ICLR. OpenReview.net, 2020.
  • [7] Alvin Chan, Yi Tay, and Yew-Soon Ong. What it thinks is important is important: Robustness transfers through input gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 332–341, 2020.
  • [8] Yinpeng Dong, Hang Su, Jun Zhu, and Fan Bao. Towards interpretable deep neural networks by leveraging adversarial examples. arXiv preprint arXiv:1708.05493, 2017.
  • [9] Christian Etmann, Sebastian Lunz, Peter Maass, and Carola Schönlieb. On the connection between adversarial robustness and saliency map interpretability. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 1823–1832. PMLR, 2019.
  • [10] Kostyantyn Filonenko, Robert Wisnovsky, Mohamed Chériet (ecole De, Denis J. Dean, Charles W. Anderson, Yann Lecun, and Corinna Costes. techreport learning multiple layers of features from tiny images, by alex, 2009.
  • [11] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR (Poster), 2015.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • [13] Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defense: Ensembles of weak defenses are not strong. In WOOT. USENIX Association, 2017.
  • [14] Beomsu Kim, Junghoon Seo, and Taegyun Jeon. Bridging adversarial robustness and gradient interpretability. arXiv preprint arXiv:1903.11626, 2019.
  • [15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
  • [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [18] Yann LeCun, Corinna Cortes, and CJ Burges. The mnist database of handwritten digits., 2010.
  • [19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR (Poster). OpenReview.net, 2018.
  • [20] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • [21] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In AAAI, pages 1660–1669. AAAI Press, 2018.
  • [22] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In NeurIPS, pages 5019–5031, 2018.
  • [23] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626. IEEE Computer Society, 2017.
  • [24] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences. CoRR, abs/1605.01713, 2016.
  • [25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [26] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017.
  • [27] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR, 2017.
  • [28] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR (Poster), 2014.
  • [29] Georgios Takos. A survey on deep learning methods for semantic image segmentation in real-time. CoRR, abs/2009.12942, 2020.
  • [30] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In ICLR (Poster). OpenReview.net, 2019.
  • [31] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan L. Yuille. Adversarial examples for semantic segmentation and object detection. In ICCV, pages 1378–1387. IEEE Computer Society, 2017.
  • [32] Cihang Xie and Alan L. Yuille. Intriguing properties of adversarial training at scale. In ICLR. OpenReview.net, 2020.
  • [33] Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 7502–7511. PMLR, 2019.
  • [34] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. CoRR, abs/1905.05055, 2019.