\newfloatcommand

capbtabboxtable[][\FBwidth]

¹¹institutetext: Paper ID 3772¹¹institutetext: Department of Electrical Engineering
Yale University
New Haven, CT, USA
{abhishek.moitra, youngeun.kim, priya.panda}@yale.edu

Adversarial Detection without
Model Information

Anonymous ECCV submission Abhishek Moitra Youngeun Kim and Priyadarshini Panda

Abstract

Prior state-of-the-art adversarial detection works are classifier model dependent, i.e., they require classifier model outputs and parameters for training the detector or during adversarial detection. This makes their detection approach classifier model specific. Furthermore, classifier model outputs and parameters might not always be accessible. To this end, we propose a classifier model independent adversarial detection method using a simple energy function to distinguish between adversarial and natural inputs. We train a standalone detector independent of the classifier model, with a layer-wise energy separation (LES) training to increase the separation between natural and adversarial energies. With this, we perform energy distribution-based adversarial detection. Our method achieves comparable performance with state-of-the-art detection works (ROC-AUC $>$ 0.9) across a wide range of gradient, score and gaussian noise attacks on CIFAR10, CIFAR100 and TinyImagenet datasets. Furthermore, compared to prior works, our detection approach is light-weight, requires less amount of training data (40% of the actual dataset) and is transferable across different datasets. For reproducibility, we provide layer-wise energy separation training code at https://github.com/Intelligent-Computing-Lab-Yale/Energy-Separation-Training

Keywords:

Adversarial Detection, Privacy-preserving, Energy-efficient Neural Networks

1 Introduction

Deep Neural Networks (DNNs) are vulnerable to adversarial attacks [2, 24] where small crafted noise is added to natural images to fool a classifier model causing high confidence mis-classifications. Prior adversarial defense works have focused on improving the prediction of classifier models on adversarial inputs. These works have used innovative techniques such as adversarial training [21], input transformation [13], randomization [28] among others. Recently, Adversarial detection has emerged as a strong defence strategy against adversarial attacks. Here, a “detector” network is trained to identify adversarial and natural inputs in a system [22, 32, 30].

Refer to caption — Figure 1: (a) In our proposed approach, we assume that the classifier model’s ( $M_{C}$ ) information (output and parameter) is in-accessible for creating the adversarial detector or during adversarial detection. However, we do require access to the dataset during detector training. (b) The detector $\mathcal{D}$ is trained on natural ( $x_{nat}$ ) and adversarial ( $x_{adv,t}$ ) data using a layer-wise energy separation (LES) training. The $x_{adv,t}$ data is generated using $x_{nat}$ and model $S_{T}$ . (c) The LES-trained detector can identify adversaries generated using a different model $S_{V}$ . Note, $S_{T}$ and $S_{V}$ can have different network architectures from the classifier model $M_{C}$ .

Prior adversarial detection works heavily rely on the classifier model outputs and parameters for detection. However, the reliance on classifier model has two major drawbacks: 1) it reduces the generalizability of adversarial detectors. This means that detector networks are specific to each classifier model and need to be retrained if the classifier model changes; 2) classifier model parameters and outputs might be in-accessible [26, 10, 31]. To this effect, we propose an adversarial detection approach with the following assumption: The outputs and parameters of an adversarially vulnerable classifier model $M_{C}$ are in-accessible for adversarial detection. In other words, no classifier model information is required for creating the detector or during adversarial detection. Fig. 1a compares the assumptions in this work with prior adversarial detection works. Note, although our work does not require classifier model information we do require data accessibility for the detection.

In this work, we train a small, standalone detector network $\mathcal{D}$ without any information about $M_{C}$ . Firstly, we propose a simple energy function to map a high dimensional input feature space to a one-dimensional output feature space (called the energy). As adversaries are created by adding noise to natural data, we find that the energies corresponding to natural and adversarial data are different. However, the difference is not sufficient to perform reliable adversarial detection. To this end, we employ a layer-wise energy separation (LES) training using an energy distance-based loss function to maximize the separation between natural and adversarial energies.

Table 1: Table showing the key differences between prior adversraial detection works and our proposed method (✓: addressed / ✕: not addressed, G: Gradient-based, S: Score-based and GN: Gaussian noise attacks). Note, A

\rightarrow

B signifies that a detector trained on A-type attacks can successfully detect B-type attacks.

Work

Transferability Across

Dataset Access

Classifier Model Access Required?

Attacks

Datasets

Metzen et al. [22], Yin et al. [32],

Sterneck et al. [25]

\rightarrow

Full Access

intermediate

activations

Moitra et al. [23], Xu et al. [30],

Grosse et al. [12]

\rightarrow

Full Access

model

training

Huang et al. [15]

\rightarrow

Full Access

model outputs

Ours

\rightarrow

G, G

\rightarrow

S, G

\rightarrow

✓

Partial Access

No access

In Fig. 1b it can be seen that the detector $\mathcal{D}$ is trained on natural ( $x_{nat}$ ) and adversarial ( $x_{adv,t}$ ) data. Here, $x_{adv,t}$ is created using model $S_{T}$ that is different from $M_{C}$ . Interestingly, we find that the LES-trained detector can identify attacks generated from the model $S_{V}$ that is different from both $S_{T}$ and $M_{C}$ . We perform extensive analysis on a wide range of gradient [21, 4, 9], score [3, 7] and decision based [6] attacks on datasets such as CIFAR10, CIFAR100 and TinyImagenet. Through our experiments, we discover certain interesting features of our detection approach that have not been shown in prior works [11, 22, 32, 23]. ➊ Firstly, our approach is agnostic to the model used for generating attacks. For example, a detector trained using $S_{T}$ =VGG16 can identify attacks generated using $S_{V}$ =ResNet18. ➋ Compared to prior works that have shown transferability on gradient to gradient-based attacks only, our detection approach can transfer across different types of adversarial attacks. For example, a detector trained on gradient-based attacks can detect certain score-based attacks such as Square [3], Auto-PGD [7]. ➌ Additionally, through our ablation studies, we find that the error amplification effect discussed in [20] aids in improving the energy separation of our detector. Consequently, we find that a higher Lipschitz constant [5] enables better adversarial detection.

As mentioned earlier, our detection approach requires access to the data. To this end, we evaluate the performance of our detection approach with lower access to the data. ➊ We find that the LES training can achieve significantly high performance even with limited data accessibility (e.g., with 40% access to the dataset). ➋ We also find that the detector is transferable across different datasets. For instance, a detector constructed using CIFAR100 data can detect adversarial attacks from the CIFAR10 and TinyImagenet datasets. However, a small amount (about 200 data samples) of the CIFAR10 and TinyImagenet datasets is still required for the transfer. Table 1 compares the salient features of our detection work with prior state-of-the-art adversarial detectors.

In summary, the contributions of our work are:

•

We propose a simple and light-weight energy metric to distinguish between clean and adversarial inputs. To perform adversarial detection, we maximize the energy distance between them using an energy distance-based objective function and a layer-wise energy separation (LES) training approach. Our adversarial detection approach does not require any access to the classifier model parameters or outputs.
•

We perform extensive experiments on benchmark datasets like CIFAR10, CIFAR100, and TinyImagenet with state-of-the-art gradient-based [21, 18], score-based [7] and decision-based [6] adversarial attacks. We find our approach yields state-of-the-art detection across different adversarial attacks and datasets.
•

Through our experiments, we find that our detection approach is agnostic to the model used for attack generation. Additionally, we show that the detector trained on gradient-based attacks can detect certain score-based attacks and gaussian noise attacks.
•

Through extensive experiments, we show that the LES training can achieve high performance with limited access to the dataset. Further, our adversarial detector can transfer across different datasets. Using transferability, a detector trained on one dataset can be reused to detect adversaries created on another dataset.

2 Background

2.1 Adversarial Attacks

In this work, we consider three different kinds of adversarial attacks. Here, we discuss those attacks in detail.

Gradient-based attacks: These attacks require gradients of the input to craft adversaries. The Fast Gradient Sign Method (FGSM) is a simple one-step adversarial attack proposed in [18]. Several works have shown that FGSM attack can be made stronger with momentum (MIFGSM) [9], random initialization (FFGSM) [27], and input diversification (DIFGSM) [29]. In contrast, the Basic Iterative Method (BIM) is an iterative attack proposed in [18]. The BIM attack with random restarts is called the Projected Gradient Descent (PGD) attack [21]. A targeted version of the PGD attack (TPGD) [33] can fool the model into mis-classifying a data as a desired class. Other multi-step attacks like Carlini-Wagner (C&W) [4] and PGD-L2 [21] are crafted by computing the L2 Norm distance between the adversary and the natural images.

Score-based attacks: Score-based attacks do not require input gradients to craft adversaries. Square attack (SQR) [3] uses multiple queries to perturb randomly selected square regions in the input. Other score-based attacks like the Autoattack (AUTO) and Auto-PGD (APGD) craft adversaries by automatically choosing the optimal attack parameters [7].

Decision-based attacks: These attacks craft adversaries based on the decision outputs of a model. The Fast Adaptive Boundary (FAB) [6] attack finds the minimum perturbation required to perform a mis-classification. The Gaussian Noise (GN) attack is created by adding gaussian noise to the input.

2.2 Performance Metrics for Adversarial Detection

To evaluate the adversarial detection performance, we use three metrics: ROC-AUC, Accuracy and Error.

Area Under the ROC Curve (AUC Score): The area under the ROC curve, compares the True Positive Rate (TPR) and the False Positive Rate (FPR) of a classifier. A high ROC-AUC score signifies a good classifier [32].

Accuracy and Error: In this work, Accuracy is defined as the fraction of natural inputs that are correctly classified by the classifier model and not rejected by the adversarial detector. While, Error is defined as the fraction of adversarial inputs that are classified incorrectly by the classifier model and not rejected by the detector. Thus, high accuracy and low error is desirable.

3 Related Works

3.1 Adversarial Classification

Here, the objective is to improve the adversarial classification accuracy of a vulnerable model. Guo et al. [13] proposed input feature transformation using JPEG compression followed by training on compressed feature space to improve the classification performance of the classifier model. Madry et al. [21] proposed adversarial training in which a classifier model is trained on adversarial and clean data to improve the adversarial and clean classification performance. Following this, several works have used noise injection into parameters [14] and ensemble adversarial training to harden the classifier model against a wide range of attacks. Lin et al. [20] showed that adversarial classification can be improved by reducing the error amplification in a network. Hence, they used adversarial training with regularization to constrain the Lipschitz constant of the network to less than unity. In our work, we do not improve the adversarial classification accuracy of the classifier model. Rather, we focus on adversarial detection to distinguish adversarial data from natural ones.

3.2 Adversarial Detection

3.2.1 Works requiring classifier model training

The following works require training of the classifier model to perform adversarial detection. Xu et al. [30] propose a method that uses outputs of multiple classifier models to estimate the difference between natural and adversarial data. Here, the classifier models are trained on natural inputs with different feature squeezing techniques at the inputs. Moitra et al. [23] uses the features from the first layer of the underlying model to perform adversarial detection. In particular, they perform adversarial detection using hardware signatures in DNN accelerators. Further, several recent works like [11] have shown that adversarial and natural data have different data distributions. While Grosse et al. [12] train the classifier model with an additional class label indicating adversarial data, Gong et al. [11] train a separate binary classifier on the natural and adversarial data generated from the classifier model to perform adversarial detection. Lee et al. [19] use a metric called the Mahalanobis distance classifier to train the classifier model. The Mahalanobis distance is used to distinguish natural from adversarial data.

3.2.2 Works requiring classifier model outputs

These works show that adversarial and natural inputs can be distinguished based on the intermediate features of the classifier model. Metzen et al. [22] and Sterneck et al. [25] use the intermediate features to train a simple binary classifier for adversarial detection. While Metzen et al. [22] use a heuristic-based method to determine the point of attachment of the detector with the classifier model, Sterneck et al. [25] use a structured metric called adversarial noise sensitivity to do the same. Similarly, Yin et al. [32] use asymmetric adversarial training to train detectors on the intermediate features of the classifier model for adversarial detection. Another work by Ahuja et al. [1] use the data distributions from the intermediate layers in the classifier model and Gaussian mixture models to perform adversarial detection. Further, Huang et al. [15] use the confidence scores from the classifier model to estimate the relative score difference corresponding to the clean and the adversarial input to perform adversarial detection. Further, they also recommend classifier model training on noisy data to improve the adversarial detection performance.

Evidently, prior works discussed in the Section 3.1 and Section 3.2.2 require training or the outputs of the classifier model for adversarial detection. In contrast, our work is focused towards performing adversarial detection without accessing the classifier model.

4 Layer-wise Energy Separation (LES) Training

4.1 Training Methodology

In this work, we define a simple energy function for layer $l$ , $\mathcal{E}^{l}$ shown in Eq. 1. The energy is defined as the average magnitude of the feature outputs in a particular layer $l$ with outputs $Z^{c,h,w}_{l}$ . Here $c,h,$ and $w$ are the number of output channels, height and width, respectively of the feature outputs in the layer $l$ . In LES training, we train the detector network to maximize the energy separation between $\mathcal{E}_{nat}$ and $\mathcal{E}_{adv}$ . Here, $\mathcal{E}_{nat}$ and $\mathcal{E}_{adv}$ are the energies corresponding to natural and adversarial inputs, respectively. Note, the energy separation is the difference between the mean values of $\mathcal{E}_{nat}$ and $\mathcal{E}_{adv}$ distributions.

\mathcal{E}^{l}=\frac{1}{chw}\sum_{i=1}^{c}\sum_{j=1}^{h}\sum_{k=1}^{w}|Z^{i,j,k}_{l}|.

(1)

Training objective: We train the detector using the following objective function:

\max_{\theta}||\mathcal{E}_{nat}^{l}-\mathcal{E}_{adv}^{l}||,

(2)

corresponding to any layer $l$ . Here, $\theta$ denotes the parameters of the detector network. In order to achieve this, we design an energy separation-based loss function that minimizes the energy distances between $\mathcal{E}_{nat}^{l}$ ( $\mathcal{E}_{adv}^{l}$ ) and $\lambda_{n}$ ( $\lambda_{a}$ ). Here, $\lambda_{n}$ and $\lambda_{a}$ are hyper-parameters denoting the desired natural and adversarial energies, respectively. An indicator variable $y$ has the value of 1 or 0 for natural and adversarial inputs, respectively.

\mathcal{L}=y~{}\mathcal{L}_{MSE}(\mathcal{E}_{nat}^{l},\lambda_{n})+(1-y)~{}\mathcal{L}_{MSE}(\mathcal{E}_{adv}^{l},\lambda_{a}).

(3)

Note, our training objective is loosely based on the current signature separation in recent work [23]. We would like to highlight that [23] focuses on modifying the classifier model’s layers to perform adversary detection. In contrast, we train the detector independently of the classifier model that enables high detection scores along with transferability across multiple datasets and attacks.

Dataset generation: The training dataset for the detector contains equal number of natural ( $x_{nat}$ ) and adversarial data ( $x_{adv,t}$ ). Here, $x_{adv,t}$ is generated using the model $S_{T}$ having a different network architecture than $M_{C}$ and trained on $x_{nat}$ using the stochastic gradient descent (SGD) algorithm.

n

layered detector (

\mathcal{D}

x_{nat}

x_{adv,t}

s_{nat}

n

layered trained detector

\mathcal{D}_{T}

\mathcal{E}_{Th}

3:for all i = 1 to n do

4: for all j = 1 to

N_{epoch}

5: Freeze layers

[0,i

1]

6: Fetch mini-batch

X_{n}

and

X_{a}

from

x_{nat}

and

x_{adv,t}

, respectively

7: Compute

\mathcal{E}_{nat}^{i}

and

\mathcal{E}_{adv}^{i}

on the mini-batch

8: Compute loss function using Eq. 3

9: Optimize layer

i

using the loss function

10: end for

11:end for

12:Generate distribution

\mathcal{E}_{s_{nat}}

with

s_{nat}

and

\mathcal{D}_{T}

13:

\mathcal{E}_{Th}

\leftarrow

K^{th}

percentile of

\mathcal{E}_{s_{nat}}

Algorithm 1 Layer-wise Energy Separation (LES) training algorithm

LES training methodology: Algorithm 13 shows the LES training approach. LES training begins with a randomly initialized $n$ layered detector, with $n$ being a hyper-parameter. Other inputs include the natural samples $x_{nat}$ , adversarial samples $x_{adv,t}$ and $s_{nat}$ . $s_{nat}$ is created by randomly selecting 200 data samples from $x_{nat}$ .

The training is carried out in multiple stages. In each stage $i$ , layers [0, $i$ - $1$ ] are frozen and layer $i$ is optimized for energy distance maximization. In each stage, mini-batches $X_{n}$ and $X_{a}$ are fetched from $x_{nat}$ and $x_{adv,t}$ , respectively. Following this, the natural and adversarial energies for layer $i$ are computed. The energies are used to compute the loss shown in Eq. 3 which is then used to optimize the $i$ th layer parameters of the detector. As layers [0, $i$ - $1$ ] are frozen, the energy separation obtained till layer $i$ - $1$ is preserved. Thus, training the $i$ th layer further increases the energy separation.

After the LES training, we obtain the trained detector $\mathcal{D}_{T}$ . At this step, we use $s_{nat}$ and $\mathcal{D}_{T}$ to obtain the a sample natural energy distribution $\mathcal{E}_{s_{nat}}$ . The $\mathcal{E}_{s_{nat}}$ is obtained at the final layer of $\mathcal{D}_{T}$ . Next, the $K$ th percentile of the $\mathcal{E}_{s_{nat}}$ distribution is chosen as the threshold energy $\mathcal{E}_{Th}$ . All energy values greater than $\mathcal{E}_{Th}$ are classified as adversarial inputs and vice-versa.

Demonstrating the Efficacy of LES training and Choosing $\mathcal{E}_{Th}$ : Fig. 2(a) shows the energy separation between $\mathcal{E}_{nat}$ and $\mathcal{E}_{adv}$ obtained after each layer of a three-layered detector trained on the CIFAR100 dataset in the layer-wise manner. The adversarial inputs correspond to PGD(4,2,10)¹¹1PGD(4,2,10)= PGD with parameters $\epsilon$ : 4/255, $\alpha$ : 2/255, $steps$ : 10. attack on the CIFAR100 dataset. It can be seen that the energy separation increases with each layer in the detector. The increasing separation significantly improves the detection of weak attacks²²2Weak attacks have smaller $\epsilon$ values compared to stronger attacks (like PGD(4,2,10)), that may go undetected at early layers due to low energy separation.

In the final layer, such attacks can be detected successfully. Fig. 2(b) shows the $\mathcal{E}_{s_{nat}}$ distribution obtained at the final layer of the trained detector $\mathcal{D}_{T}$ . The value of $\mathcal{E}_{Th}$ is chosen as the $K$ th percentile of the distribution.

4.2 Lipschitz Constant and Energy Separation

The Lipschitz constant has been associated with the error amplification effect in a neural network [20]. The Lipschitz constant of a feed-forward function $f_{l}$ for any layer $l$ , can be expressed as follows:

Lip(f_{l})\leq\prod_{i=1}^{l}{Lip(\phi_{i})}.

(4)

where $\phi$ can be a linear, convolutional, max pooling or an activation layer.

Fig. 3 shows the $Lip(f_{l})$ values for three different networks: i) A randomly initialized 3-layered detector network, ii) the 3-layered detector after LES training on the CIFAR100 dataset, and iii) a VGG16 network trained on the CIFAR100 dataset using SGD. Interestingly, we observe that the value of $Lip(f_{l})$ rises as the energy separation between $\mathcal{E}_{nat}^{l}$ and $\mathcal{E}_{adv}^{l}$ increases. This suggests that the layer-wise training increases the error amplification factor in the detector network.

In prior works that minimize the adversarial perturbations in a network [5, 20], it has been shown that a lower Lipschitz constant helps in reducing the adversarial perturbations and thus increase adversarial robustness. However, in this work, we show that a higher Lipschitz constant amplifies the energy separation and thus, higher adversarial detection.

4.3 Effect of Detector Width

Fig. 4 shows the separation between natural and adversarial energies (corresponding to PGD(8,4,10) attacks) computed at the last layer of three detector models: Detector A: Conv(3,64)-ReLU-Conv(64,128)- ReLU-Conv(128,256), Detector B: Conv(3,32)-ReLU-
Conv(32,64)-ReLU-Conv(64,128) and Detector C: Conv(3,8)-ReLU-Conv(8,16)-ReLU-
Conv(16,32). Clearly, detectors having narrower layers achieve higher energy separation in the final layer compared to wider detector networks. Consequently, light-weight detectors improve the adversarial detection performance. Further, we observe that detectors without batch normalization achieve higher energy separation compared to detectors with batch normalization layers. In all our experiments throughout the paper, we use detectors without batch-normalization layers.

4.4 Effect of Choosing the Threshold Percentile

In this section, we show how adversarial detection is affected by the choice of $\mathcal{E}_{Th}$ . The value $\mathcal{E}_{Th}$ is chosen as the $K^{th}$ percentile of the $\mathcal{E}_{s_{nat}}$ distribution. Fig. 5(a) shows the AUC scores for four different strengths of PGD attacks across five different $K$ values corresponding to CIFAR100 dataset. It is observed that lower values of $K$ yield higher AUC scores for weak attacks (such as PGD2). However, the AUC scores for stronger attacks are lower. Additionally, at lower $K$ values, the accuracy (See Section 2.2) is lower as seen in Fig. 5(b). This is because at low $K$ values, higher number of clean data samples are mis-classified as adversarial. At extremely high values of $K$ ( $K$ = 98), the AUC scores for weak attacks significantly fall. However, a higher accuracy is achieved. In both Fig. 5(a) and Fig. 5(b), $S_{T}$ =ResNet18 and $S_{V}$ =Mobilenet-v2 and $M_{C}$ =VGG19. Evidently, choosing the $K$ value is a trade-off between the detection of weaker attacks and accuracy. Therefore, $K$ is a design choice that can be set according to the designer’s priority. In this work, we choose $K$ =95 for all our experiments.

4.5 Selecting $\lambda_{n}$ and $\lambda_{a}$

$\lambda_{n}$ and $\lambda_{a}$ are hyperparameters that denote the desired natural and adversarial energies for a particular layer $l$ in the detector. We show how the choice of $\lambda_{n}$ and $\lambda_{a}$ affects the detector performance. For this, we train 7 different detectors with different $\lambda_{a}$ values. D1(0.3,0.7,1.7) (with $\lambda_{a}$ =0.3, 0.7, 1.7 for layer 1, 2, 3, respectively), D2(0.5,0.9,1.9), D3(0.7,1.1,2.1), D4(0.9,1.3,2.3) and D5(1.1,1.5,2.5), D6(1.3,1.7,2.7), D7(1.5,1.9,2.9). For all the models, $\lambda_{n}$ = 0.1. As seen in Fig. 5(c) detection capability increases from D1 to D4 as the $\lambda_{a}$ increases and decreases beyond M4. However, the changes in AUC scores for different sets of $\lambda_{a}$ values is marginal. We find that all detector models (D1-D7) are good detectors for CIFAR10, CIFAR100 and TinyImagenet datasets (see Supplementary Material).

5 Results

5.1 Experimental Setup

We use three different image datasets for our experiments: CIFAR10 [17], CIFAR100 [17], and TinyImagenet [8]. Both the training and validation data are scaled between (0,1) in all our experiments. For generating adversarial attacks, we use the torchattacks library [16].

For all our experiments, we use a 3-layered detector with the following architecture: Conv(3,8)- ReLU- Conv(8,16)- ReLU- Conv(16,32). For training the 3-layered detector, the LES training contains 3 stages. The LES training is performed with adversarial data created by PGD(8,4,10) attack and model $S_{T}$ . For the CIFAR100 and CIFAR10 dataset, LES training is carried out with learning rates 0.005, 0.005 and 0.001, respectively in each stage. While for the TinyImagenet dataset, the learning rates correspond to 0.003, 0.001 and 0.001). During LES training, we use $\lambda_{n}$ = 0.1 while $\lambda_{a}$ = (0.9, 1.3, 2.3). The LES training is carried out for 500 epochs in each stage. Due to the small detector network size, the training time is significantly small (1 to 2 GPU hours for the 3-layered detector). Additionally, models $S_{T}$ , $S_{V}$ and $M_{C}$ are trained on the respective datasets using the SGD algorithm. All experiments are conducted using Pytorch 1.5 platform with Nvidia rtx2080ti GPU backend.

5.2 Performance on Different Adversarial Attacks

Detector Model	Attacks $\rightarrow$		FGSM	PGD4	PGDL2	C $\&$ W	SQR	APGD	AUTO	GN	FAB
Black-Box $S_{T}$ : ResNet18 $S_{V}$ : MobileNet-v2 $M_{C}$ : VGG19	No Detector	Error	66	54	91	63	88	53	69	77	78
	No Detector	Accuracy					59
	With Detector	Error	0	0	0	0	23	28	28	0	72
	With Detector	Accuracy					56
	AUC Score		0.98	0.97	0.98	0.79	0.79	0.78	0.79	0.98	0.5
White-Box $S_{T}$ : ResNet18 $S_{V}$ : VGG19 $M_{C}$ : VGG19	No Detector	Error	82	95	95	100	99	92	100	77	99
	No Detector	Accuracy					59
	With Detector	Error	0	0	19	39	39	39	39	0	92
	With Detector	Accuracy					56
	AUC Score		0.98	0.97	0.98	0.79	0.8	0.78	0.8	0.98	0.5

Table 2: Table showing the AUC score, error and accuracy value for a single detector model subject to white-box and black-box attacks. Note, accuracy value is detector specific and does not depend on the type of attack.

Table 3: Table showing the AUC scores for different detectors trained on CIFAR100. The detectors have high detection capability for different adversarial attacks irrespective of the architecture of the

S_{V}

and

S_{T}

models.

	Detector 1		Detector 2		Detector 3
Attacks $\downarrow$	$S_{T}$ MobileNet-V2		$S_{T}$ ResNet18		$S_{T}$ VGG16
Attacks $\downarrow$	$S_{V}$ ResNet18	VGG16	$S_{V}$ VGG16	MobileNet-V2	$S_{V}$ ResNet18	MobileNet-V2
FGSM [18] BIM [18] PGD8 [21] PGD4 [21] PGDL2 [21] FFGSM [27] TPGD [33] MIFGSM [9] DIFGSM [29] C&W [4] SQR [3] APGD [7] AUTO [7] GN	0.98 0.98 0.98 0.97 0.87 0.98 0.98 0.98 0.97 0.75 0.75 0.74 0.75 0.98	0.98 0.97 0.98 0.96 0.83 0.98 0.97 0.97 0.97 0.78 0.77 0.74 0.77 0.98	0.98 0.98 0.98 0.97 0.85 0.98 0.98 0.98 0.98 0.77 0.77 0.75 0.77 0.98	0.98 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.79 0.79 0.78 0.79 0.98	0.98 0.98 0.98 0.97 0.92 0.98 0.98 0.98 0.98 0.75 0.75 0.74 0.75 0.98	0.98 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.79 0.79 0.78 0.79 0.98

In Table 2, we consider an LES-trained detector on the CIFAR100 dataset with $S_{T}$ =ResNet18. The
classifier model $M_{C}$ = VGG19. We consider white-box and black-box scenarios for validating the performance of the given detector. In white-box attack, the attacker has complete knowledge of the classifier model. While in black-box attack, the attacker does not have any information about the classifier model. To this effect, we use $S_{V}$ : MobileNet-v2 and VGG19, for black-box and white-box attacks, respectively. Note, the premise of this paper is to train an adversarial detector and perform adversarial detection without any classifier model information. The white-box attack scenario considered here pertains only to the generation of attacks (i.e., $S_{V}$ : VGG19) and not during detection or LES training.

Evidently, the detector has a high AUC score across a wide range of gradient, score and gaussian noise attacks (attack parameters shown in Supplementary Material). Consequently, adding the detector significantly lowers the the error (see Section 2.2) value of the classifier model compared to the “No Detector” case. Further, the reduction is substantially higher in case of white-box attacks compared to black-box attacks. However, we find that adding the detector leads to a 3% drop in accuracy (see Section 2.2 and Section 4.4) compared to the “No Detector” case.

Note, that the detector fails to detect decision-based attacks like the FAB attack. This is because FAB attacks add minimal perturbations to the natural input which does not change the $\mathcal{E}_{adv}$ value significantly. Consequently, a high error value is observed.

Interestingly, the detector shown in Table 2 demonstrates model agnostic behavior. For example, the detector can identify adversarial inputs generated using models $S_{V}$ = VGG19 and $S_{V}$ = MobileNet-v2. To further illustrate this model agnostic property, we create three detectors- Detector 1: $S_{T}$ =
MobileNet-v2, Detector 2: $S_{T}$ = ResNet18 and Detector 3: $S_{T}$ = VGG16. All the detectors are trained using LES training. Table 3 shows the AUC scores of the detectors under adversarial attacks generated using $S_{V}$ models that have entirely different network architectures from the model $S_{T}$ used in the training. Evidently, the detectors have significantly high performance over different adversarial attacks and are agnostic to the $S_{T}$ and $S_{V}$ models used during training and attack generation, respectively. Note, AUC scores for FAB attack have not been shown as it is known that our detection approach cannot identify FAB attacks. Similar observations can be made for the CIFAR10 and TinyImagenet datasets and have been shown in Supplementary Material.

5.3 Comparison with Prior Works

Table 4: Table comparing the AUC scores, memory and computation overhead of our method with prior state-of-the-art adversarial detection works. AUC scores that are not reported by prior works have not been shown.

Work	Dataset	Weak Attacks		Strong Attacks		Number of Parameters	Number of Operations
Work	Dataset	FGSM $\epsilon$ : 4/255	PGD $\epsilon$ : 4/255	FGSM $\epsilon$ : 16/255	PGD $\epsilon$ : 16/255	Number of Parameters	Number of Operations
Metzen et al. [22]	CIFAR10	1	0.96	Not Reported	Not Reported	$500k$	$1.4M$
Yin et al. [32]	CIFAR10	Not Reported	Not Reported	Not Reported	0.953	$213k$	$6.04M$
Moitra et al. [23]	CIFAR10	0.85	0.88	0.98	0.895	$500k$	$1.7M$
Sterneck et al. [25]	CIFAR10	0.99	0.998	1	1	$525k$	$4.7M$
Xu et al. [30]	CIFAR10	0.208	0.505	Not Reported	Not Reported	$10G$	$10G$
Gong et al. [11]	CIFAR10	0.003	Not Reported	1	Not Reported	$64k$	$477k$
Ours	CIFAR10	0.97	0.98	0.98	0.98	5.9k	500k
Sterneck et al. [25]	CIFAR100	0.95	0.99	0.99	1	$525k$	$4.7M$
Moitra et al. [23]	CIFAR100	0.6	0.64	0.98	0.99	$500k$	$1.7M$
Ours	CIFAR100	0.96	0.97	0.98	0.98	5.9k	500k
Moitra et al. [23]	TinyImagenet	0.52	0.56	0.84	0.65	$500k$	$1.7M$
Ours	TinyImagenet	0.97	0.98	0.98	0.98	5.9k	147k

Table 4 compares the performance and cost-effectiveness of our proposed method with prior state-of-the-art adversarial detection works. For comparison, detectors for different datasets are created using LES training with $S_{T}$ =ResNet18. $S_{V}$ = MobileNet-v2. Clearly, we achieve comparable detection performance compared to prior adversarial detection approaches. It must be noted that our work does not aim to outperform prior adversarial detection works. Instead, the striking feature of our approach is to overcome the dependence of prior adversarial detection works on the classifier model for training and adversarial detection.

Computational overhead: Besides high adversarial detection, our method requires 10-100x less number of operations and parameters for adversarial detection compared to all the prior works. Here, the number of operations is estimated as total number of dot-product computations across all layers of the detector. Note, prior works use large number and size of detectors. Further, as their approaches are classifier model dependent, their detectors are always implemented with the classifier model leading to a high memory and computation overhead. The low memory and computation overhead makes our approach suitable for deployment in resource constrained systems.

5.4 LES-Training with Limited Training Data

In this section we evaluate the performance of different detectors created with limited access to the actual dataset. For this, we create 3 detectors trained on subsets of size 1)10% 2)40% and 3)60% of the CIFAR100 dataset. The subsets are created by random sampling from the actual dataset. For reference, we also show the performance of a detector trained on the full dataset. In all cases, the detectors are created with $S_{T}$ = ResNet18 and tested on attacks generated using $S_{V}$ =MobileNet-v2 networks. It can be observed that detectors trained with just 40% of the training data can achieve AUC scores comparable with the detector trained on the full data. Interestingly, for some attacks (such as GN) merely 10% of the training data is sufficient for achieving a high performance. Please refer to Supplementary Material for similar results on the TinyImagenet and CIFAR10 datasets.

5.5 Transferability Across Datasets

In this section, we explore the following question: Can a detector trained on dataset A (source dataset) be used to detect adversaries from another dataset B (target dataset)?. Methodology: LES training is performed to create a detector network on the source dataset. Following this, a sample dataset $s_{nat,target}$ is sampled from the target dataset. Through our experiments, we find that a size of 200 samples for $s_{nat,target}$ data is sufficient for transferability to the target dataset. The $s_{nat,target}$ and LES-trained detector is used to generate the $\mathcal{E}_{s_{nat,target}}$ distribution. $\mathcal{E}_{Th}$ is chosen as the 95th percentile of the $\mathcal{E}_{s_{nat,target}}$ .

Transferability of detectors trained on TinyImagenet. We train a detector D-Tiny with source dataset= TinyImagenet (shown in Fig. 7). The detector is trained using LES training with $S_{T}$ = ResNet18 network. D-Tiny is transferred to target datasets- CIFAR100 and CIFAR10 using the methodology explained above. Here, $S_{V}$ = MobileNet-v2. Evidently, D-Tiny being trained on the TinyImagenet dataset transfers well across both CIFAR10 and CIFAR100 datasets. For example, D-Tiny achieves similar performance against attacks on the CIFAR100 dataset as Detector1, Detector 2 and Detector 3 from Table 3.

Interestingly, even detectors trained on smaller dataset like CIFAR100 can transfer reasonably to a larger dataset such as TinyImagenet (see Supplementary Material).

In Fig. 8, we evaluate the transferability of a detector D-Tiny40 trained on 40% of source dataset TinyImagenet and transferred to target datasets CIFAR100 and CIFAR10. We find that even detectors trained with limited access to the TinyImagenet dataset can transfer successfully. Similar experiments with other datasets are shown in the Supplementary Material.

6 Conclusion

In this work, we propose a classifier model independent energy separation-based adversarial detection method that does not require access to the classifier model. Further, we achieve comparable detection performance on a wide range of adversarial attacks at an extremely small memory and computational overhead compared to prior works. Moreover, we show that our method achieves requires lesser data compared to prior works while achieving significant performance. Further the detector is also transferable across different datasets. However, our method entails few limitations. Although our detection is classifier model agnostic, it still requires access to the data. Although the data dependency of our approach is lower compared to prior works, an interesting future direction can be using synthetic data during LES training and transferring to real-world data without accessing any partial data. This will make our detection approach more robust. Further, our approach, while successfully detecting a suite of attacks, fails to detect decision-based adversarial attacks. Future works can focus on finding more sophisticated energy functions to detect such attacks.

References

[1] Ahuja, N.A., Ndiour, I., Kalyanpur, T., Tickoo, O.: Probabilistic modeling of deep features for out-of-distribution and adversarial detection. arXiv preprint arXiv:1909.11786 (2019)
[2] Akhtar, N., Mian, A.: Threat of adversarial attacks on deep learning in computer vision: A survey. Ieee Access 6, 14410–14430 (2018)
[3] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: European Conference on Computer Vision. pp. 484–501. Springer (2020)
[4] Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 ieee symposium on security and privacy (sp). pp. 39–57. IEEE (2017)
[5] Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., Usunier, N.: Parseval networks: Improving robustness to adversarial examples. In: International Conference on Machine Learning. pp. 854–863. PMLR (2017)
[6] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: International Conference on Machine Learning. pp. 2196–2205. PMLR (2020)
[7] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International conference on machine learning. pp. 2206–2216. PMLR (2020)
[8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[9] Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J.: Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9185–9193 (2018)
[10] George, G.C., Moitra, A., Caculo, S., Prince, A.A.: Efficient architecture for implementation of hermite interpolation on fpga. In: 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP). pp. 7–12. IEEE (2018)
[11] Gong, Z., Wang, W., Ku, W.S.: Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960 (2017)
[12] Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017)
[13] Guo, C., Rana, M., Cisse, M., Van Der Maaten, L.: Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017)
[14] He, Z., Rakin, A.S., Fan, D.: Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 588–597 (2019)
[15] Huang, B., Wang, Y., Wang, W.: Model-agnostic adversarial detection by random perturbations. In: IJCAI. pp. 4689–4696 (2019)
[16] Kim, H.: Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950 (2020)
[17] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
[18] Kurakin, A., Goodfellow, I., Bengio, S., et al.: Adversarial examples in the physical world (2016)
[19] Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31 (2018)
[20] Lin, J., Gan, C., Han, S.: Defensive quantization: When efficiency meets robustness. arXiv preprint arXiv:1904.08444 (2019)
[21] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
[22] Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267 (2017)
[23] Moitra, A., Panda, P.: Detectx–adversarial input detection using current signatures in memristive xbar arrays. IEEE Transactions on Circuits and Systems I: Regular Papers (2021)
[24] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia conference on computer and communications security. pp. 506–519 (2017)
[25] Sterneck, R., Moitra, A., Panda, P.: Noise sensitivity-based energy efficient and robust adversary detection in neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021)
[26] Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., Jahre, M., Vissers, K.: Finn: A framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. pp. 65–74 (2017)
[27] Wong, E., Rice, L., Kolter, J.Z.: Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994 (2020)
[28] Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.: Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991 (2017)
[29] Xie, C., Zhang, Z., Zhou, Y., Bai, S., Wang, J., Ren, Z., Yuille, A.L.: Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2730–2739 (2019)
[30] Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155 (2017)
[31] Yin, S., Jiang, Z., Seo, J.S., Seok, M.: Xnor-sram: In-memory computing sram macro for binary/ternary deep neural networks. IEEE Journal of Solid-State Circuits 55(6), 1733–1743 (2020)
[32] Yin, X., Kolouri, S., Rohde, G.K.: Gat: Generative adversarial training for adversarial example detection and robust classification. In: International conference on learning representations (2019)
[33] Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M.: Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning. pp. 7472–7482. PMLR (2019)

Adversarial Detection without Model Information

Abstract

Keywords:

1 Introduction

2 Background

2.1 Adversarial Attacks

2.2 Performance Metrics for Adversarial Detection

3 Related Works

3.1 Adversarial Classification

3.2 Adversarial Detection

3.2.1 Works requiring classifier model training

3.2.2 Works requiring classifier model outputs

4 Layer-wise Energy Separation (LES) Training

4.1 Training Methodology

4.2 Lipschitz Constant and Energy Separation

4.3 Effect of Detector Width

4.4 Effect of Choosing the Threshold Percentile

4.5 Selecting λn\lambda_{n} and λa\lambda_{a}

5 Results

5.1 Experimental Setup

5.2 Performance on Different Adversarial Attacks

5.3 Comparison with Prior Works

5.4 LES-Training with Limited Training Data

5.5 Transferability Across Datasets

6 Conclusion

References

Adversarial Detection without
Model Information

4.5 Selecting $\lambda_{n}$ and $\lambda_{a}$