This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

11institutetext: Paper ID 377211institutetext: Department of Electrical Engineering
Yale University
New Haven, CT, USA
{abhishek.moitra, youngeun.kim, priya.panda}@yale.edu

Adversarial Detection without
Model Information

Anonymous ECCV submission    Abhishek Moitra    Youngeun Kim    and Priyadarshini Panda
Abstract

Prior state-of-the-art adversarial detection works are classifier model dependent, i.e., they require classifier model outputs and parameters for training the detector or during adversarial detection. This makes their detection approach classifier model specific. Furthermore, classifier model outputs and parameters might not always be accessible. To this end, we propose a classifier model independent adversarial detection method using a simple energy function to distinguish between adversarial and natural inputs. We train a standalone detector independent of the classifier model, with a layer-wise energy separation (LES) training to increase the separation between natural and adversarial energies. With this, we perform energy distribution-based adversarial detection. Our method achieves comparable performance with state-of-the-art detection works (ROC-AUC >> 0.9) across a wide range of gradient, score and gaussian noise attacks on CIFAR10, CIFAR100 and TinyImagenet datasets. Furthermore, compared to prior works, our detection approach is light-weight, requires less amount of training data (40% of the actual dataset) and is transferable across different datasets. For reproducibility, we provide layer-wise energy separation training code at https://github.com/Intelligent-Computing-Lab-Yale/Energy-Separation-Training

Keywords:
Adversarial Detection, Privacy-preserving, Energy-efficient Neural Networks

1 Introduction

Deep Neural Networks (DNNs) are vulnerable to adversarial attacks [2, 24] where small crafted noise is added to natural images to fool a classifier model causing high confidence mis-classifications. Prior adversarial defense works have focused on improving the prediction of classifier models on adversarial inputs. These works have used innovative techniques such as adversarial training [21], input transformation [13], randomization [28] among others. Recently, Adversarial detection has emerged as a strong defence strategy against adversarial attacks. Here, a “detector” network is trained to identify adversarial and natural inputs in a system [22, 32, 30].

Refer to caption
Figure 1: (a) In our proposed approach, we assume that the classifier model’s (MCM_{C}) information (output and parameter) is in-accessible for creating the adversarial detector or during adversarial detection. However, we do require access to the dataset during detector training. (b) The detector 𝒟\mathcal{D} is trained on natural (xnatx_{nat}) and adversarial (xadv,tx_{adv,t}) data using a layer-wise energy separation (LES) training. The xadv,tx_{adv,t} data is generated using xnatx_{nat} and model STS_{T}. (c) The LES-trained detector can identify adversaries generated using a different model SVS_{V}. Note, STS_{T} and SVS_{V} can have different network architectures from the classifier model MCM_{C}.

Prior adversarial detection works heavily rely on the classifier model outputs and parameters for detection. However, the reliance on classifier model has two major drawbacks: 1) it reduces the generalizability of adversarial detectors. This means that detector networks are specific to each classifier model and need to be retrained if the classifier model changes; 2) classifier model parameters and outputs might be in-accessible [26, 10, 31]. To this effect, we propose an adversarial detection approach with the following assumption: The outputs and parameters of an adversarially vulnerable classifier model MCM_{C} are in-accessible for adversarial detection. In other words, no classifier model information is required for creating the detector or during adversarial detection. Fig. 1a compares the assumptions in this work with prior adversarial detection works. Note, although our work does not require classifier model information we do require data accessibility for the detection.

In this work, we train a small, standalone detector network 𝒟\mathcal{D} without any information about MCM_{C}. Firstly, we propose a simple energy function to map a high dimensional input feature space to a one-dimensional output feature space (called the energy). As adversaries are created by adding noise to natural data, we find that the energies corresponding to natural and adversarial data are different. However, the difference is not sufficient to perform reliable adversarial detection. To this end, we employ a layer-wise energy separation (LES) training using an energy distance-based loss function to maximize the separation between natural and adversarial energies.

Table 1: Table showing the key differences between prior adversraial detection works and our proposed method (✓: addressed / ✕: not addressed, G: Gradient-based, S: Score-based and GN: Gaussian noise attacks). Note, A \rightarrow B signifies that a detector trained on A-type attacks can successfully detect B-type attacks.
Work Transferability Across Dataset Access Classifier Model Access Required?
Attacks Datasets
Metzen et al. [22], Yin et al. [32],
Sterneck et al. [25]
G \rightarrow G
Full Access
intermediate
activations
Moitra et al. [23], Xu et al. [30],
Grosse et al. [12]
G \rightarrow G
Full Access
model
training
Huang et al. [15] G \rightarrow G Full Access model outputs
Ours G \rightarrow G, G \rightarrow S, G \rightarrow GN Partial Access No access

In Fig. 1b it can be seen that the detector 𝒟\mathcal{D} is trained on natural (xnatx_{nat}) and adversarial (xadv,tx_{adv,t}) data. Here, xadv,tx_{adv,t} is created using model STS_{T} that is different from MCM_{C}. Interestingly, we find that the LES-trained detector can identify attacks generated from the model SVS_{V} that is different from both STS_{T} and MCM_{C}. We perform extensive analysis on a wide range of gradient [21, 4, 9], score [3, 7] and decision based [6] attacks on datasets such as CIFAR10, CIFAR100 and TinyImagenet. Through our experiments, we discover certain interesting features of our detection approach that have not been shown in prior works [11, 22, 32, 23]. ➊ Firstly, our approach is agnostic to the model used for generating attacks. For example, a detector trained using STS_{T}=VGG16 can identify attacks generated using SVS_{V}=ResNet18. ➋ Compared to prior works that have shown transferability on gradient to gradient-based attacks only, our detection approach can transfer across different types of adversarial attacks. For example, a detector trained on gradient-based attacks can detect certain score-based attacks such as Square [3], Auto-PGD [7]. ➌ Additionally, through our ablation studies, we find that the error amplification effect discussed in [20] aids in improving the energy separation of our detector. Consequently, we find that a higher Lipschitz constant [5] enables better adversarial detection.

As mentioned earlier, our detection approach requires access to the data. To this end, we evaluate the performance of our detection approach with lower access to the data. ➊ We find that the LES training can achieve significantly high performance even with limited data accessibility (e.g., with 40% access to the dataset). ➋ We also find that the detector is transferable across different datasets. For instance, a detector constructed using CIFAR100 data can detect adversarial attacks from the CIFAR10 and TinyImagenet datasets. However, a small amount (about 200 data samples) of the CIFAR10 and TinyImagenet datasets is still required for the transfer. Table 1 compares the salient features of our detection work with prior state-of-the-art adversarial detectors.

In summary, the contributions of our work are:

  • We propose a simple and light-weight energy metric to distinguish between clean and adversarial inputs. To perform adversarial detection, we maximize the energy distance between them using an energy distance-based objective function and a layer-wise energy separation (LES) training approach. Our adversarial detection approach does not require any access to the classifier model parameters or outputs.

  • We perform extensive experiments on benchmark datasets like CIFAR10, CIFAR100, and TinyImagenet with state-of-the-art gradient-based [21, 18], score-based [7] and decision-based [6] adversarial attacks. We find our approach yields state-of-the-art detection across different adversarial attacks and datasets.

  • Through our experiments, we find that our detection approach is agnostic to the model used for attack generation. Additionally, we show that the detector trained on gradient-based attacks can detect certain score-based attacks and gaussian noise attacks.

  • Through extensive experiments, we show that the LES training can achieve high performance with limited access to the dataset. Further, our adversarial detector can transfer across different datasets. Using transferability, a detector trained on one dataset can be reused to detect adversaries created on another dataset.

2 Background

2.1 Adversarial Attacks

In this work, we consider three different kinds of adversarial attacks. Here, we discuss those attacks in detail.

Gradient-based attacks: These attacks require gradients of the input to craft adversaries. The Fast Gradient Sign Method (FGSM) is a simple one-step adversarial attack proposed in [18]. Several works have shown that FGSM attack can be made stronger with momentum (MIFGSM) [9], random initialization (FFGSM) [27], and input diversification (DIFGSM) [29]. In contrast, the Basic Iterative Method (BIM) is an iterative attack proposed in [18]. The BIM attack with random restarts is called the Projected Gradient Descent (PGD) attack [21]. A targeted version of the PGD attack (TPGD) [33] can fool the model into mis-classifying a data as a desired class. Other multi-step attacks like Carlini-Wagner (C&W) [4] and PGD-L2 [21] are crafted by computing the L2 Norm distance between the adversary and the natural images.

Score-based attacks: Score-based attacks do not require input gradients to craft adversaries. Square attack (SQR) [3] uses multiple queries to perturb randomly selected square regions in the input. Other score-based attacks like the Autoattack (AUTO) and Auto-PGD (APGD) craft adversaries by automatically choosing the optimal attack parameters [7].

Decision-based attacks: These attacks craft adversaries based on the decision outputs of a model. The Fast Adaptive Boundary (FAB) [6] attack finds the minimum perturbation required to perform a mis-classification. The Gaussian Noise (GN) attack is created by adding gaussian noise to the input.

2.2 Performance Metrics for Adversarial Detection

To evaluate the adversarial detection performance, we use three metrics: ROC-AUC, Accuracy and Error.

Area Under the ROC Curve (AUC Score): The area under the ROC curve, compares the True Positive Rate (TPR) and the False Positive Rate (FPR) of a classifier. A high ROC-AUC score signifies a good classifier [32].

Accuracy and Error: In this work, Accuracy is defined as the fraction of natural inputs that are correctly classified by the classifier model and not rejected by the adversarial detector. While, Error is defined as the fraction of adversarial inputs that are classified incorrectly by the classifier model and not rejected by the detector. Thus, high accuracy and low error is desirable.

3 Related Works

3.1 Adversarial Classification

Here, the objective is to improve the adversarial classification accuracy of a vulnerable model. Guo et al. [13] proposed input feature transformation using JPEG compression followed by training on compressed feature space to improve the classification performance of the classifier model. Madry et al. [21] proposed adversarial training in which a classifier model is trained on adversarial and clean data to improve the adversarial and clean classification performance. Following this, several works have used noise injection into parameters [14] and ensemble adversarial training to harden the classifier model against a wide range of attacks. Lin et al. [20] showed that adversarial classification can be improved by reducing the error amplification in a network. Hence, they used adversarial training with regularization to constrain the Lipschitz constant of the network to less than unity. In our work, we do not improve the adversarial classification accuracy of the classifier model. Rather, we focus on adversarial detection to distinguish adversarial data from natural ones.

3.2 Adversarial Detection

3.2.1 Works requiring classifier model training

The following works require training of the classifier model to perform adversarial detection. Xu et al. [30] propose a method that uses outputs of multiple classifier models to estimate the difference between natural and adversarial data. Here, the classifier models are trained on natural inputs with different feature squeezing techniques at the inputs. Moitra et al. [23] uses the features from the first layer of the underlying model to perform adversarial detection. In particular, they perform adversarial detection using hardware signatures in DNN accelerators. Further, several recent works like [11] have shown that adversarial and natural data have different data distributions. While Grosse et al. [12] train the classifier model with an additional class label indicating adversarial data, Gong et al. [11] train a separate binary classifier on the natural and adversarial data generated from the classifier model to perform adversarial detection. Lee et al. [19] use a metric called the Mahalanobis distance classifier to train the classifier model. The Mahalanobis distance is used to distinguish natural from adversarial data.

3.2.2 Works requiring classifier model outputs

These works show that adversarial and natural inputs can be distinguished based on the intermediate features of the classifier model. Metzen et al. [22] and Sterneck et al. [25] use the intermediate features to train a simple binary classifier for adversarial detection. While Metzen et al. [22] use a heuristic-based method to determine the point of attachment of the detector with the classifier model, Sterneck et al. [25] use a structured metric called adversarial noise sensitivity to do the same. Similarly, Yin et al. [32] use asymmetric adversarial training to train detectors on the intermediate features of the classifier model for adversarial detection. Another work by Ahuja et al. [1] use the data distributions from the intermediate layers in the classifier model and Gaussian mixture models to perform adversarial detection. Further, Huang et al. [15] use the confidence scores from the classifier model to estimate the relative score difference corresponding to the clean and the adversarial input to perform adversarial detection. Further, they also recommend classifier model training on noisy data to improve the adversarial detection performance.

Evidently, prior works discussed in the Section 3.1 and Section 3.2.2 require training or the outputs of the classifier model for adversarial detection. In contrast, our work is focused towards performing adversarial detection without accessing the classifier model.

4 Layer-wise Energy Separation (LES) Training

4.1 Training Methodology

In this work, we define a simple energy function for layer ll, l\mathcal{E}^{l} shown in Eq. 1. The energy is defined as the average magnitude of the feature outputs in a particular layer ll with outputs Zlc,h,wZ^{c,h,w}_{l}. Here c,h,c,h, and ww are the number of output channels, height and width, respectively of the feature outputs in the layer ll. In LES training, we train the detector network to maximize the energy separation between nat\mathcal{E}_{nat} and adv\mathcal{E}_{adv}. Here, nat\mathcal{E}_{nat} and adv\mathcal{E}_{adv} are the energies corresponding to natural and adversarial inputs, respectively. Note, the energy separation is the difference between the mean values of nat\mathcal{E}_{nat} and adv\mathcal{E}_{adv} distributions.

l=1chwi=1cj=1hk=1w|Zli,j,k|.\mathcal{E}^{l}=\frac{1}{chw}\sum_{i=1}^{c}\sum_{j=1}^{h}\sum_{k=1}^{w}|Z^{i,j,k}_{l}|. (1)

Training objective: We train the detector using the following objective function:

maxθnatladvl,\max_{\theta}||\mathcal{E}_{nat}^{l}-\mathcal{E}_{adv}^{l}||, (2)

corresponding to any layer ll. Here, θ\theta denotes the parameters of the detector network. In order to achieve this, we design an energy separation-based loss function that minimizes the energy distances between natl\mathcal{E}_{nat}^{l} (advl\mathcal{E}_{adv}^{l}) and λn\lambda_{n} (λa\lambda_{a}). Here, λn\lambda_{n} and λa\lambda_{a} are hyper-parameters denoting the desired natural and adversarial energies, respectively. An indicator variable yy has the value of 1 or 0 for natural and adversarial inputs, respectively.

=yMSE(natl,λn)+(1y)MSE(advl,λa).\mathcal{L}=y~{}\mathcal{L}_{MSE}(\mathcal{E}_{nat}^{l},\lambda_{n})+(1-y)~{}\mathcal{L}_{MSE}(\mathcal{E}_{adv}^{l},\lambda_{a}). (3)

Note, our training objective is loosely based on the current signature separation in recent work [23]. We would like to highlight that [23] focuses on modifying the classifier model’s layers to perform adversary detection. In contrast, we train the detector independently of the classifier model that enables high detection scores along with transferability across multiple datasets and attacks.

Dataset generation: The training dataset for the detector contains equal number of natural (xnatx_{nat}) and adversarial data (xadv,tx_{adv,t}). Here, xadv,tx_{adv,t} is generated using the model STS_{T} having a different network architecture than MCM_{C} and trained on xnatx_{nat} using the stochastic gradient descent (SGD) algorithm.

1:nn layered detector (𝒟\mathcal{D}), xnatx_{nat}, xadv,tx_{adv,t}, snats_{nat}
2:nn layered trained detector 𝒟T\mathcal{D}_{T}, Th\mathcal{E}_{Th}
3:for all i = 1 to n do
4:    for all j = 1 to NepochN_{epoch} do
5:         Freeze layers [0,i[0,i-1]1]
6:         Fetch mini-batch XnX_{n} and XaX_{a} from xnatx_{nat} and xadv,tx_{adv,t}, respectively
7:         Compute nati\mathcal{E}_{nat}^{i} and advi\mathcal{E}_{adv}^{i} on the mini-batch
8:         Compute loss function using Eq. 3
9:         Optimize layer ii using the loss function
10:    end for
11:end for
12:Generate distribution snat\mathcal{E}_{s_{nat}} with snats_{nat} and 𝒟T\mathcal{D}_{T}
13:Th\mathcal{E}_{Th} \leftarrow KthK^{th} percentile of snat\mathcal{E}_{s_{nat}}
Algorithm 1 Layer-wise Energy Separation (LES) training algorithm
Refer to caption
(a)
Refer to caption
(b)
Figure 2: a) The separation between the clean and adversarial energy distributions increase with with each layer as the LES training proceeds. b) The value of Th\mathcal{E}_{Th} is set as the Kth percentile value of the snat\mathcal{E}_{s_{nat}} distribution.

LES training methodology: Algorithm 13 shows the LES training approach. LES training begins with a randomly initialized nn layered detector, with nn being a hyper-parameter. Other inputs include the natural samples xnatx_{nat}, adversarial samples xadv,tx_{adv,t} and snats_{nat}. snats_{nat} is created by randomly selecting 200 data samples from xnatx_{nat}.

The training is carried out in multiple stages. In each stage ii, layers [0,ii-11] are frozen and layer ii is optimized for energy distance maximization. In each stage, mini-batches XnX_{n} and XaX_{a} are fetched from xnatx_{nat} and xadv,tx_{adv,t}, respectively. Following this, the natural and adversarial energies for layer ii are computed. The energies are used to compute the loss shown in Eq. 3 which is then used to optimize the iith layer parameters of the detector. As layers [0, ii-11] are frozen, the energy separation obtained till layer ii-11 is preserved. Thus, training the iith layer further increases the energy separation.

After the LES training, we obtain the trained detector 𝒟T\mathcal{D}_{T}. At this step, we use snats_{nat} and 𝒟T\mathcal{D}_{T} to obtain the a sample natural energy distribution snat\mathcal{E}_{s_{nat}}. The snat\mathcal{E}_{s_{nat}} is obtained at the final layer of 𝒟T\mathcal{D}_{T}. Next, the KKth percentile of the snat\mathcal{E}_{s_{nat}} distribution is chosen as the threshold energy Th\mathcal{E}_{Th}. All energy values greater than Th\mathcal{E}_{Th} are classified as adversarial inputs and vice-versa.

Demonstrating the Efficacy of LES training and Choosing Th\mathcal{E}_{Th}: Fig. 2(a) shows the energy separation between nat\mathcal{E}_{nat} and adv\mathcal{E}_{adv} obtained after each layer of a three-layered detector trained on the CIFAR100 dataset in the layer-wise manner. The adversarial inputs correspond to PGD(4,2,10)111PGD(4,2,10)= PGD with parameters ϵ\epsilon: 4/255, α\alpha: 2/255, stepssteps: 10. attack on the CIFAR100 dataset. It can be seen that the energy separation increases with each layer in the detector. The increasing separation significantly improves the detection of weak attacks222Weak attacks have smaller ϵ\epsilon values compared to stronger attacks (like PGD(4,2,10)), that may go undetected at early layers due to low energy separation.

In the final layer, such attacks can be detected successfully. Fig. 2(b) shows the snat\mathcal{E}_{s_{nat}} distribution obtained at the final layer of the trained detector 𝒟T\mathcal{D}_{T}. The value of Th\mathcal{E}_{Th} is chosen as the KKth percentile of the distribution.

4.2 Lipschitz Constant and Energy Separation

Refer to caption
Figure 3: Comparison of Lipschitz constant of feed-forward function (Lip(fl)Lip(f_{l})) for three different networks. Note, only first three layers of the VGG16 network have been shown. It is observed that Lip(fl)Lip(f_{l}) increases exponentially as the energy separation (in red) between nat\mathcal{E}_{nat} and adv\mathcal{E}_{adv} increases. Compared to the detector, rise in Lip(fl)Lip(f_{l}) is negligible for the other two networks.

The Lipschitz constant has been associated with the error amplification effect in a neural network [20]. The Lipschitz constant of a feed-forward function flf_{l} for any layer ll, can be expressed as follows:

Lip(fl)i=1lLip(ϕi).Lip(f_{l})\leq\prod_{i=1}^{l}{Lip(\phi_{i})}. (4)

where ϕ\phi can be a linear, convolutional, max pooling or an activation layer.

Fig. 3 shows the Lip(fl)Lip(f_{l}) values for three different networks: i) A randomly initialized 3-layered detector network, ii) the 3-layered detector after LES training on the CIFAR100 dataset, and iii) a VGG16 network trained on the CIFAR100 dataset using SGD. Interestingly, we observe that the value of Lip(fl)Lip(f_{l}) rises as the energy separation between natl\mathcal{E}_{nat}^{l} and advl\mathcal{E}_{adv}^{l} increases. This suggests that the layer-wise training increases the error amplification factor in the detector network.

In prior works that minimize the adversarial perturbations in a network [5, 20], it has been shown that a lower Lipschitz constant helps in reducing the adversarial perturbations and thus increase adversarial robustness. However, in this work, we show that a higher Lipschitz constant amplifies the energy separation and thus, higher adversarial detection.

4.3 Effect of Detector Width

Refer to caption
Figure 4: The clean and adversarial energy distributions corresponding to wide and narrow detector network architectures. Wider detector networks (detector A) achieve smaller energy separation compared to narrow detectors (detector C).

Fig. 4 shows the separation between natural and adversarial energies (corresponding to PGD(8,4,10) attacks) computed at the last layer of three detector models: Detector A: Conv(3,64)-ReLU-Conv(64,128)- ReLU-Conv(128,256), Detector B: Conv(3,32)-ReLU-
Conv(32,64)-ReLU-Conv(64,128) and Detector C: Conv(3,8)-ReLU-Conv(8,16)-ReLU-
Conv(16,32). Clearly, detectors having narrower layers achieve higher energy separation in the final layer compared to wider detector networks. Consequently, light-weight detectors improve the adversarial detection performance. Further, we observe that detectors without batch normalization achieve higher energy separation compared to detectors with batch normalization layers. In all our experiments throughout the paper, we use detectors without batch-normalization layers.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: a) AUC scores corresponding to different strengths of PGD attacks- PGD2: PGD(2,1,10), PGD4: PGD(4,2,10), PGD8: PGD(8,4,10), and PGD16: PGD(16,4,10) for five different KK values. b) Accuracy values corresponding to the natural data for five different KK values. c) AUC scores for PGD(4,2,10) attack across 7 different models trained using different λa\lambda_{a} values.

4.4 Effect of Choosing the Threshold Percentile

In this section, we show how adversarial detection is affected by the choice of Th\mathcal{E}_{Th}. The value Th\mathcal{E}_{Th} is chosen as the KthK^{th} percentile of the snat\mathcal{E}_{s_{nat}} distribution. Fig. 5(a) shows the AUC scores for four different strengths of PGD attacks across five different KK values corresponding to CIFAR100 dataset. It is observed that lower values of KK yield higher AUC scores for weak attacks (such as PGD2). However, the AUC scores for stronger attacks are lower. Additionally, at lower KK values, the accuracy (See Section 2.2) is lower as seen in Fig. 5(b). This is because at low KK values, higher number of clean data samples are mis-classified as adversarial. At extremely high values of KK (KK= 98), the AUC scores for weak attacks significantly fall. However, a higher accuracy is achieved. In both Fig. 5(a) and Fig. 5(b), STS_{T}=ResNet18 and SVS_{V}=Mobilenet-v2 and MCM_{C}=VGG19. Evidently, choosing the KK value is a trade-off between the detection of weaker attacks and accuracy. Therefore, KK is a design choice that can be set according to the designer’s priority. In this work, we choose KK=95 for all our experiments.

4.5 Selecting λn\lambda_{n} and λa\lambda_{a}

λn\lambda_{n} and λa\lambda_{a} are hyperparameters that denote the desired natural and adversarial energies for a particular layer ll in the detector. We show how the choice of λn\lambda_{n} and λa\lambda_{a} affects the detector performance. For this, we train 7 different detectors with different λa\lambda_{a} values. D1(0.3,0.7,1.7) (with λa\lambda_{a}=0.3, 0.7, 1.7 for layer 1, 2, 3, respectively), D2(0.5,0.9,1.9), D3(0.7,1.1,2.1), D4(0.9,1.3,2.3) and D5(1.1,1.5,2.5), D6(1.3,1.7,2.7), D7(1.5,1.9,2.9). For all the models, λn\lambda_{n}= 0.1. As seen in Fig. 5(c) detection capability increases from D1 to D4 as the λa\lambda_{a} increases and decreases beyond M4. However, the changes in AUC scores for different sets of λa\lambda_{a} values is marginal. We find that all detector models (D1-D7) are good detectors for CIFAR10, CIFAR100 and TinyImagenet datasets (see Supplementary Material).

5 Results

5.1 Experimental Setup

We use three different image datasets for our experiments: CIFAR10 [17], CIFAR100 [17], and TinyImagenet [8]. Both the training and validation data are scaled between (0,1) in all our experiments. For generating adversarial attacks, we use the torchattacks library [16].

For all our experiments, we use a 3-layered detector with the following architecture: Conv(3,8)- ReLU- Conv(8,16)- ReLU- Conv(16,32). For training the 3-layered detector, the LES training contains 3 stages. The LES training is performed with adversarial data created by PGD(8,4,10) attack and model STS_{T}. For the CIFAR100 and CIFAR10 dataset, LES training is carried out with learning rates 0.005, 0.005 and 0.001, respectively in each stage. While for the TinyImagenet dataset, the learning rates correspond to 0.003, 0.001 and 0.001). During LES training, we use λn\lambda_{n}= 0.1 while λa\lambda_{a}= (0.9, 1.3, 2.3). The LES training is carried out for 500 epochs in each stage. Due to the small detector network size, the training time is significantly small (1 to 2 GPU hours for the 3-layered detector). Additionally, models STS_{T}, SVS_{V} and MCM_{C} are trained on the respective datasets using the SGD algorithm. All experiments are conducted using Pytorch 1.5 platform with Nvidia rtx2080ti GPU backend.

5.2 Performance on Different Adversarial Attacks

Detector Model Attacks \rightarrow   FGSM   PGD4   PGDL2   C&\&W   SQR   APGD   AUTO   GN   FAB
Black-Box STS_{T}: ResNet18 SVS_{V}: MobileNet-v2 MCM_{C}: VGG19 No Detector Error   66   54   91   63  88   53   69   77   78
Accuracy 59
With Detector Error 0 0 0 0 23 28 28 0 72
Accuracy 56
AUC Score 0.98 0.97 0.98 0.79 0.79 0.78 0.79 0.98 0.5
White-Box STS_{T}: ResNet18 SVS_{V}: VGG19 MCM_{C}: VGG19 No Detector Error 82 95 95 100 99 92 100 77 99
Accuracy 59
With Detector Error 0 0 19 39 39 39 39 0 92
Accuracy 56
AUC Score 0.98 0.97 0.98 0.79 0.8 0.78 0.8 0.98 0.5
Table 2: Table showing the AUC score, error and accuracy value for a single detector model subject to white-box and black-box attacks. Note, accuracy value is detector specific and does not depend on the type of attack.
Table 3: Table showing the AUC scores for different detectors trained on CIFAR100. The detectors have high detection capability for different adversarial attacks irrespective of the architecture of the SVS_{V} and STS_{T} models.
Detector 1 Detector 2 Detector 3
Attacks \downarrow STS_{T} MobileNet-V2 STS_{T} ResNet18 STS_{T} VGG16
SVS_{V} ResNet18   VGG16 SVS_{V} VGG16  MobileNet-V2 SVS_{V} ResNet18   MobileNet-V2
FGSM [18] BIM [18] PGD8 [21] PGD4 [21] PGDL2 [21] FFGSM [27] TPGD [33] MIFGSM [9] DIFGSM [29] C&W [4] SQR [3] APGD [7] AUTO [7] GN 0.98 0.98 0.98 0.97 0.87 0.98 0.98 0.98 0.97 0.75 0.75 0.74 0.75 0.98 0.98 0.97 0.98 0.96 0.83 0.98 0.97 0.97 0.97 0.78 0.77 0.74 0.77 0.98    0.98 0.98 0.98 0.97 0.85 0.98 0.98 0.98 0.98 0.77 0.77 0.75 0.77 0.98 0.98 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.79 0.79 0.78 0.79 0.98 0.98 0.98 0.98 0.97 0.92 0.98 0.98 0.98 0.98 0.75 0.75 0.74 0.75 0.98 0.98 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.79 0.79 0.78 0.79 0.98

In Table 2, we consider an LES-trained detector on the CIFAR100 dataset with STS_{T}=ResNet18. The
classifier model MCM_{C}= VGG19. We consider white-box and black-box scenarios for validating the performance of the given detector. In white-box attack, the attacker has complete knowledge of the classifier model. While in black-box attack, the attacker does not have any information about the classifier model. To this effect, we use SVS_{V}: MobileNet-v2 and VGG19, for black-box and white-box attacks, respectively. Note, the premise of this paper is to train an adversarial detector and perform adversarial detection without any classifier model information. The white-box attack scenario considered here pertains only to the generation of attacks (i.e., SVS_{V}: VGG19) and not during detection or LES training.

Evidently, the detector has a high AUC score across a wide range of gradient, score and gaussian noise attacks (attack parameters shown in Supplementary Material). Consequently, adding the detector significantly lowers the the error (see Section 2.2) value of the classifier model compared to the “No Detector” case. Further, the reduction is substantially higher in case of white-box attacks compared to black-box attacks. However, we find that adding the detector leads to a 3% drop in accuracy (see Section 2.2 and Section 4.4) compared to the “No Detector” case.

Note, that the detector fails to detect decision-based attacks like the FAB attack. This is because FAB attacks add minimal perturbations to the natural input which does not change the adv\mathcal{E}_{adv} value significantly. Consequently, a high error value is observed.

Interestingly, the detector shown in Table 2 demonstrates model agnostic behavior. For example, the detector can identify adversarial inputs generated using models SVS_{V}= VGG19 and SVS_{V}= MobileNet-v2. To further illustrate this model agnostic property, we create three detectors- Detector 1: STS_{T}=
MobileNet-v2, Detector 2: STS_{T}= ResNet18 and Detector 3: STS_{T}= VGG16. All the detectors are trained using LES training. Table 3 shows the AUC scores of the detectors under adversarial attacks generated using SVS_{V} models that have entirely different network architectures from the model STS_{T} used in the training. Evidently, the detectors have significantly high performance over different adversarial attacks and are agnostic to the STS_{T} and SVS_{V} models used during training and attack generation, respectively. Note, AUC scores for FAB attack have not been shown as it is known that our detection approach cannot identify FAB attacks. Similar observations can be made for the CIFAR10 and TinyImagenet datasets and have been shown in Supplementary Material.

5.3 Comparison with Prior Works

Table 4: Table comparing the AUC scores, memory and computation overhead of our method with prior state-of-the-art adversarial detection works. AUC scores that are not reported by prior works have not been shown.
Work Dataset Weak Attacks Strong Attacks Number of Parameters Number of Operations
FGSM ϵ\epsilon: 4/255 PGD ϵ\epsilon: 4/255 FGSM ϵ\epsilon: 16/255 PGD ϵ\epsilon: 16/255
Metzen et al. [22] CIFAR10 1 0.96 Not Reported Not Reported 500k500k 1.4M1.4M
Yin et al. [32] CIFAR10 Not Reported Not Reported Not Reported 0.953 213k213k 6.04M6.04M
Moitra et al. [23] CIFAR10 0.85 0.88 0.98 0.895 500k500k 1.7M1.7M
Sterneck et al. [25] CIFAR10 0.99 0.998 1 1 525k525k 4.7M4.7M
Xu et al. [30] CIFAR10 0.208 0.505 Not Reported Not Reported 10G10G 10G10G
Gong et al. [11] CIFAR10 0.003 Not Reported 1 Not Reported 64k64k 477k477k
Ours CIFAR10 0.97 0.98 0.98 0.98 5.9k 500k
Sterneck et al. [25] CIFAR100 0.95 0.99 0.99 1 525k525k 4.7M4.7M
Moitra et al. [23] CIFAR100 0.6 0.64 0.98 0.99 500k500k 1.7M1.7M
Ours CIFAR100 0.96 0.97 0.98 0.98 5.9k 500k
Moitra et al. [23] TinyImagenet 0.52 0.56 0.84 0.65 500k500k 1.7M1.7M
Ours TinyImagenet 0.97 0.98 0.98 0.98 5.9k 147k

Table 4 compares the performance and cost-effectiveness of our proposed method with prior state-of-the-art adversarial detection works. For comparison, detectors for different datasets are created using LES training with STS_{T}=ResNet18. SVS_{V}= MobileNet-v2. Clearly, we achieve comparable detection performance compared to prior adversarial detection approaches. It must be noted that our work does not aim to outperform prior adversarial detection works. Instead, the striking feature of our approach is to overcome the dependence of prior adversarial detection works on the classifier model for training and adversarial detection.

Computational overhead: Besides high adversarial detection, our method requires 10-100x less number of operations and parameters for adversarial detection compared to all the prior works. Here, the number of operations is estimated as total number of dot-product computations across all layers of the detector. Note, prior works use large number and size of detectors. Further, as their approaches are classifier model dependent, their detectors are always implemented with the classifier model leading to a high memory and computation overhead. The low memory and computation overhead makes our approach suitable for deployment in resource constrained systems.

5.4 LES-Training with Limited Training Data

Refer to caption
Figure 6: AUC scores corresponding to different adversarial attacks for detectors created using 10% , 40% and 60% of the CIFAR100 dataset.

In this section we evaluate the performance of different detectors created with limited access to the actual dataset. For this, we create 3 detectors trained on subsets of size 1)10% 2)40% and 3)60% of the CIFAR100 dataset. The subsets are created by random sampling from the actual dataset. For reference, we also show the performance of a detector trained on the full dataset. In all cases, the detectors are created with STS_{T}= ResNet18 and tested on attacks generated using SVS_{V}=MobileNet-v2 networks. It can be observed that detectors trained with just 40% of the training data can achieve AUC scores comparable with the detector trained on the full data. Interestingly, for some attacks (such as GN) merely 10% of the training data is sufficient for achieving a high performance. Please refer to Supplementary Material for similar results on the TinyImagenet and CIFAR10 datasets.

5.5 Transferability Across Datasets

In this section, we explore the following question: Can a detector trained on dataset A (source dataset) be used to detect adversaries from another dataset B (target dataset)?. Methodology: LES training is performed to create a detector network on the source dataset. Following this, a sample dataset snat,targets_{nat,target} is sampled from the target dataset. Through our experiments, we find that a size of 200 samples for snat,targets_{nat,target} data is sufficient for transferability to the target dataset. The snat,targets_{nat,target} and LES-trained detector is used to generate the snat,target\mathcal{E}_{s_{nat,target}} distribution. Th\mathcal{E}_{Th} is chosen as the 95th percentile of the snat,target\mathcal{E}_{s_{nat,target}}.

Refer to caption
Figure 7: AUC scores for a detector trained with source dataset TinyImagenet and transferred to target datasets CIFAR10 and CIFAR100.

Transferability of detectors trained on TinyImagenet. We train a detector D-Tiny with source dataset= TinyImagenet (shown in Fig. 7). The detector is trained using LES training with STS_{T}= ResNet18 network. D-Tiny is transferred to target datasets- CIFAR100 and CIFAR10 using the methodology explained above. Here, SVS_{V}= MobileNet-v2. Evidently, D-Tiny being trained on the TinyImagenet dataset transfers well across both CIFAR10 and CIFAR100 datasets. For example, D-Tiny achieves similar performance against attacks on the CIFAR100 dataset as Detector1, Detector 2 and Detector 3 from Table 3.

Refer to caption
Figure 8: A detector trained on 40% of TinyImagenet successfully transfers to CIFAR100 and CIFAR10.

Interestingly, even detectors trained on smaller dataset like CIFAR100 can transfer reasonably to a larger dataset such as TinyImagenet (see Supplementary Material).

In Fig. 8, we evaluate the transferability of a detector D-Tiny40 trained on 40% of source dataset TinyImagenet and transferred to target datasets CIFAR100 and CIFAR10. We find that even detectors trained with limited access to the TinyImagenet dataset can transfer successfully. Similar experiments with other datasets are shown in the Supplementary Material.

6 Conclusion

In this work, we propose a classifier model independent energy separation-based adversarial detection method that does not require access to the classifier model. Further, we achieve comparable detection performance on a wide range of adversarial attacks at an extremely small memory and computational overhead compared to prior works. Moreover, we show that our method achieves requires lesser data compared to prior works while achieving significant performance. Further the detector is also transferable across different datasets. However, our method entails few limitations. Although our detection is classifier model agnostic, it still requires access to the data. Although the data dependency of our approach is lower compared to prior works, an interesting future direction can be using synthetic data during LES training and transferring to real-world data without accessing any partial data. This will make our detection approach more robust. Further, our approach, while successfully detecting a suite of attacks, fails to detect decision-based adversarial attacks. Future works can focus on finding more sophisticated energy functions to detect such attacks.

References

  • [1] Ahuja, N.A., Ndiour, I., Kalyanpur, T., Tickoo, O.: Probabilistic modeling of deep features for out-of-distribution and adversarial detection. arXiv preprint arXiv:1909.11786 (2019)
  • [2] Akhtar, N., Mian, A.: Threat of adversarial attacks on deep learning in computer vision: A survey. Ieee Access 6, 14410–14430 (2018)
  • [3] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: European Conference on Computer Vision. pp. 484–501. Springer (2020)
  • [4] Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 ieee symposium on security and privacy (sp). pp. 39–57. IEEE (2017)
  • [5] Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., Usunier, N.: Parseval networks: Improving robustness to adversarial examples. In: International Conference on Machine Learning. pp. 854–863. PMLR (2017)
  • [6] Croce, F., Hein, M.: Minimally distorted adversarial examples with a fast adaptive boundary attack. In: International Conference on Machine Learning. pp. 2196–2205. PMLR (2020)
  • [7] Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International conference on machine learning. pp. 2206–2216. PMLR (2020)
  • [8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [9] Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J.: Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9185–9193 (2018)
  • [10] George, G.C., Moitra, A., Caculo, S., Prince, A.A.: Efficient architecture for implementation of hermite interpolation on fpga. In: 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP). pp. 7–12. IEEE (2018)
  • [11] Gong, Z., Wang, W., Ku, W.S.: Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960 (2017)
  • [12] Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017)
  • [13] Guo, C., Rana, M., Cisse, M., Van Der Maaten, L.: Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017)
  • [14] He, Z., Rakin, A.S., Fan, D.: Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 588–597 (2019)
  • [15] Huang, B., Wang, Y., Wang, W.: Model-agnostic adversarial detection by random perturbations. In: IJCAI. pp. 4689–4696 (2019)
  • [16] Kim, H.: Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950 (2020)
  • [17] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
  • [18] Kurakin, A., Goodfellow, I., Bengio, S., et al.: Adversarial examples in the physical world (2016)
  • [19] Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31 (2018)
  • [20] Lin, J., Gan, C., Han, S.: Defensive quantization: When efficiency meets robustness. arXiv preprint arXiv:1904.08444 (2019)
  • [21] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
  • [22] Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267 (2017)
  • [23] Moitra, A., Panda, P.: Detectx–adversarial input detection using current signatures in memristive xbar arrays. IEEE Transactions on Circuits and Systems I: Regular Papers (2021)
  • [24] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia conference on computer and communications security. pp. 506–519 (2017)
  • [25] Sterneck, R., Moitra, A., Panda, P.: Noise sensitivity-based energy efficient and robust adversary detection in neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021)
  • [26] Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., Jahre, M., Vissers, K.: Finn: A framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. pp. 65–74 (2017)
  • [27] Wong, E., Rice, L., Kolter, J.Z.: Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994 (2020)
  • [28] Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.: Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991 (2017)
  • [29] Xie, C., Zhang, Z., Zhou, Y., Bai, S., Wang, J., Ren, Z., Yuille, A.L.: Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2730–2739 (2019)
  • [30] Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155 (2017)
  • [31] Yin, S., Jiang, Z., Seo, J.S., Seok, M.: Xnor-sram: In-memory computing sram macro for binary/ternary deep neural networks. IEEE Journal of Solid-State Circuits 55(6), 1733–1743 (2020)
  • [32] Yin, X., Kolouri, S., Rohde, G.K.: Gat: Generative adversarial training for adversarial example detection and robust classification. In: International conference on learning representations (2019)
  • [33] Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M.: Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning. pp. 7472–7482. PMLR (2019)