This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis

Junfeng Guo, Ang Li, Cong Liu
Department of Computer Science
The University of Texas at Dallas
{jxg170016,angli,cong}@utdallas.edu
Abstract

Deep neural networks (DNNs) are proved to be vulnerable against backdoor attacks. A backdoor is often embedded in the target DNNs through injecting a backdoor trigger into training examples, which can cause the target DNNs misclassify an input attached with the backdoor trigger. Existing backdoor detection methods often require the access to the original poisoned training data, the parameters of the target DNNs, or the predictive confidence for each given input, which are impractical in many real-world applications, e.g., on-device deployed DNNs. We address the black-box hard-label backdoor detection problem where the DNN is fully black-box and only its final output label is accessible. We approach this problem from the optimization perspective and show that the objective of backdoor detection is bounded by an adversarial objective. Further theoretical and empirical studies reveal that this adversarial objective leads to a solution with highly skewed distribution; a singularity is often observed in the adversarial map of a backdoor-infected example, which we call the adversarial singularity phenomenon. Based on this observation, we propose the adversarial extreme value analysis (AEVA) to detect backdoors in black-box neural networks. AEVA is based on an extreme value analysis of the adversarial map, computed from the monte-carlo gradient estimation. Evidenced by extensive experiments across multiple popular tasks and backdoor attacks, our approach is shown effective in detecting backdoor attacks under the black-box hard-label scenarios.

1 Introduction

Deep Neural Networks (DNNs) have pervasively been used in a wide range of applications such as facial recognition (Masi et al., 2018), object detection (Szegedy et al., 2013; Li et al., 2022), autonomous driving (Okuyama et al., 2018), and home assistants (Singh et al., 2020; Zhai et al., 2021). In the meanwhile, DNNs become increasingly complex. Training state-of-the-art models requires enormous data and expensive computation. To address this problem, vendors and developers start to provide pre-trained DNN models. Similar to softwares shared on GitHub, pre-trained DNN models are being published and shared on online venues like the BigML model market, ONNX zoo and Caffe model zoo.

Since the dawn of software distribution, there has been an ongoing war between publishers sneaking malicious code and backdoors in their software and security personnel detecting them. Recent studies show that DNN models can contain similar backdoors, which are induced due to contaminated training data. Sometimes models containing backdoors can perform better than the regular models under untampered test inputs. However, under inputs tampered with a specific pattern (called trojan trigger) models containing backdoors can suffer from significant accuracy loss.

There has been a significant amount of recent work on detecting the backdoor triggers. However, those solutions require access to the original poisoned training data (Chen et al., 2019a; Tran et al., 2018; Huang et al., 2022), the parameters of the trained model (Chen et al., 2019b; Guo et al., 2020; Liu et al., 2019; Wang et al., 2019; 2020; Dong et al., 2021; Kolouri et al., 2020), or the predicted confidence score of each class (Dong et al., 2021). Unfortunately, it is costly and often impractical for the defender to access the original poisoned training dataset. In situations when DNNs are deployed in safety-critical platforms (Fowers et al., 2018; Okuyama et al., 2018; Li et al., 2021b) or in cloud services (Fowers et al., 2018; Chen et al., 2019c), it is also impractical to access either their parameters or the predicted confidence score of each class (Chen et al., 2019c; 2020b).

We present the black-box hard-label backdoor detection problem where the DNN is a fully black-box and only its final output label is accessible (Fig. 1). Detecting backdoor-infected DNNs in such black-box setting becomes critical due to the emerging model deployment in embedded devices and remote cloud servers.

Refer to caption
Figure 1: An illustration of the black-box hard-label backdoors.

In this setting, the typical optimization objective of backdoor detection (Chen et al., 2019b; Liu et al., 2019) becomes impossible to solve due to the limited information. However, according to a theoretical analysis, we show that the backdoor objective is bounded by an adversarial objective, which can be optimized using Monte Carlo gradient estimation in our black-box hard-label setup. Further theoretical and empirical studies reveal that this adversarial objective leads to a solution with highly skewed distribution; a singularity is likely to be observed in the adversarial map of a backdoor-infected example, which we call the adversarial singularity phenomenon.

Based on these findings, we propose the adversarial extreme value analysis (AEVA) algorithm to detect backdoors in black-box neural networks. The AEVA algorithm is based on an extreme value analysis (Leadbetter, 1991) on the adversarial map. We detect the adversarial singularity phenomenon by looking at the adversarial peak, i.e., maximum value of the computed adversarial perturbation map. We perform a statistical study which reveals that, there is around 60%60\% chance that the adversarial peak of a random sample for a backdoor-infected DNN is larger than the adversarial peak of any examples for an uninfected DNN. Inspired by the Univariate theory, we propose a global adversarial peak (GAP) value by sampling multiple examples and choosing the maximum over their adversarial peaks, to ensure a high success rate. Following previous works (Dong et al., 2021; Wang et al., 2019), the Median Absolute Deviation (MAD) algorithm is implemented on top of the GAP values to test whether a DNN is backdoor-infected.

Through extensive experiments across three popular tasks and state-of-the-art backdoor techniques, AEVA is proved to be effective in detecting backdoor attacks under the black-box hard-label scenarios. The results show that AEVA can efficiently detect backdoor-infected DNNs, yielding an overall detection accuracy 86.7%\geq 86.7\% across various tasks, DNN models and triggers. Rather interestingly, when comparing to two state-of-art white-box backdoor detection methods, AEVA yields comparable performance, even though AEVA is a black-box method with limited access to information.

We summarize our contributions as below 111Code :https://github.com/JunfengGo/AEVA-Blackbox-Backdoor-Detection-main:

  1. 1.

    To the best of our knowledge, we are the first to present the black-box hard-label backdoor detection problem and provide an effective solution to this problem.

  2. 2.

    We provide a theoretical analysis which shows backdoor detection optimization is bounded by an adversarial objective. And we further reveal the adversarial singularity phenomenon where the adversarial perturbation computed from a backdoor-infected neural network is likely to suffer from a highly skewed distribution.

  3. 3.

    We propose a generic backdoor detection framework AEVA, optimizing the adversarial objective and performing extreme value analysis on the optimized adversarial map. AEVA is applicable to the black-box hard-label setting with a Monte Carlo gradient estimation.

  4. 4.

    We evaluate AEVA on three widely-adopted tasks with different backdoor trigger implementations and complex black-box attack variants. All results suggest that AEVA is effective in black-box hard-label backdoor detection.

1.1 Related Work

Backdoor attacks. BadNets (Gu et al., 2019) is probably the first work on backdoor attacks against DNNs, which causes target misclassification to DNNs through injecting a small trigger into some training samples and mis-labeling these samples with a specific target class. Trojaning attack (Liu et al., 2018) generates a trigger which can cause the maximum activation value of certain selected neurons with limited training data. Chen et al. (Chen et al., 2017) further propose to perform backdoor attacks under a rather weak threat model where the attacker cannot access the target model and training dataset. Most recently, a set of backdoor attacks (Ji et al., 2018; Liu et al., 2020; Saha et al., 2020; Turner et al., 2019; Yao et al., 2019) have been proposed, which are built upon existing work but focused on various specific scenarios, e.g., physical-world, transfer learning, etc.

Backdoor detection (white-box). A few works  (Chen et al., 2019a; Tran et al., 2018) are proposed to detect the backdoor samples within the training dataset. Chen et al. (2019a) propose a neuron activation clustering approach to identify backdoor samples through clustering the training data based on the neuron activation of the target DNN. Tran et al. (2018) distinguish backdoor samples from clean samples based on the spectrum of the feature covariance of the target DNN. Few other works focus on detecting backdoor-infected DNNs (Liu et al., 2019; Wang et al., 2019). ABS (Liu et al., 2019) identifies the infected DNNs by seeking compromised neurons representing the features of backdoor triggers. Neural Cleanse (Wang et al., 2019) conducts the reverse engineering to restore the trigger through solving an optimization problem. Recent works (Chen et al., 2019b; Guo et al., 2020; Wang et al., 2020) improve Neural Cleanse with better objective functions. However, these methods are all white-box based, requiring access to model parameters or internal neuron values.

Backdoor detection (black-box). A recent work (Chen et al., 2019b) claims to detect backdoor attacks in the “black-box” settings. However, their method still need the DNN’s parameters to train a separate generator (Goodfellow et al., 2014). So, strictly speaking, their method is not “black-box”, which is also revealed by (Dong et al., 2021). To the best of our knowledge, Dong et al. (2021) is the only existing work on detecting backdoor-infected DNNs in the black-box settings. However, their method requires the predictive confidence score for each input to perform the NES algorithm (Wierstra et al., 2014), which weakens its practicability. Our work differs from all previous methods in that we address a purely black-box setup where only the hard output label is accessible. Both model parameters and training examples are inaccessible in the black-box hard-label setting.

2 Preliminaries: black-box backdoor attack and defense

2.1 Threat model

Our considered threat model contains two parts: the adversary and the defender. The threat model of the adversary follows previous works (Chen et al., 2017; Gu et al., 2019; Liu et al., 2018). In this model, the attacker can inject an arbitrary amount of backdoor samples into the training dataset and cause target misclassification to a specific label without affecting the model’s accuracy on normal examples. From the perspective of the defender, we consider the threat model with the weakest assumption, in which the poisoned training dataset and the parameters of the target DNNs f()f(\cdot) are inaccessible. Moreover, the defender can only obtain the final predictive label for each input from the target DNNs and a validation set (40 images for each class). Therefore, the defender can only query the target DNNs to obtain its final decisions. The defender’s goal is to identify whether the target DNNs is infected and which label is infected.

2.2 Problem definition

Consistent with prior studies (Dong et al., 2021; Kolouri et al., 2020; Wang et al., 2019; Huang et al., 2022), we deem a DNN is backdoor infected if one can make an arbitrary input misclassified as the target label, with minor modification to the input. Without loss of generability, given the original input xnx\in\mathbb{R}^{n}, the modified input containing the backdoor trigger can be formulated as:

x^=b(x)=(1m)x+mΔ,\hat{x}=b(x)=(1-m)\odot x+m\odot\Delta, (1)

where Δn\Delta\in\mathbb{R}^{n} represents the backdoor trigger and \odot represents the element-wise product. m{0,1}nm\in\{0,1\}^{n} is a binary mask that ensures the position and magnitude of the backdoor trigger. Typically, m1||m||_{1} ought to be very small to ensure the visually-indistinguishability between x^\hat{x} and xx.

We define f(;θ)f(\cdot;\theta) as the target DNN with parameters θ\theta, which is a function that maps the input xx to a label yy. ϕ(x,y;θ)[0,1]\phi(x,y;\theta)\in[0,1] is defined as the probability that xx is classified as yy by f(;θ)f(\cdot;\theta). The argument θ\theta is sometimes omitted in favor of simplicity. For ff to mislabel x^\hat{x} as the target label yty_{t}, there are two possible reasons: the backdoored training data either contains the same trigger but mislabeled as yty_{t} or contains similar features as the normal inputs (not belonging to yty_{t}, but visually similar to yty_{t}) and labeled as yty_{t}.

θ^=argminθi=1No(ϕ(xi,yi;θ),yi)+j=1Nb(ϕ(x^j,yt;θ),yt),\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^{N_{o}}\ell(\phi(x_{i},y_{i};\theta),y_{i})+\sum_{j=1}^{N_{b}}\ell(\phi(\hat{x}_{j},y_{t};\theta),y_{t}), (2)

where there are NoN_{o} original training examples and NbN_{b} backdoor examples. ()\ell(\cdot) represents the cross-entropy loss function; xix_{i} and yiy_{i} represent the training sample and its corresponding ground-truth label, respectively. In the inference phase, the target DNN will predict each x^\hat{x} as the target label yty_{t}, which can be formulated as: f(x^)=ytf(\hat{x})=y_{t}. Prior works (Dong et al., 2021; Wang et al., 2019; Chen et al., 2019b) seek to reveal the existence of a backdoor through investigating whether there exists a backdoor trigger with minimal m1||m||_{1} that can cause misclassification to a certain label.

We mainly study the sparse mask since it is widely used in previous works (Wang et al., 2020; Dong et al., 2021; Chen et al., 2019b; Guo et al., 2020; Wang et al., 2019; Gu et al., 2019). In cases when the mask becomes dense, the mask mm has to contain low magnitude continuous values to ensure the backdoor examples visually indistinguishable. However, when the value of mm is small, every normal example xix_{i} becomes near the decision boundary of function ϕ\phi. It is revealed by Taylor theorem, ϕ(x^i)=ϕ((1m)xi+mΔ)=ϕ(xi+m(Δxi))ϕ(xi)+m(Δxi)xϕ\phi(\hat{x}_{i})=\phi((1-m)x_{i}+m\Delta)=\phi(x_{i}+m(\Delta-x_{i}))\approx\phi(x_{i})+m(\Delta-x_{i})^{\intercal}\nabla_{x}\phi. And the function output label changes dramatically from ϕ(xi)\phi(x_{i}) to ϕ(x^i)\phi(\hat{x}_{i}), so the gradient xϕ\nabla_{x}\phi has to have a large magnitude, which results in a non-robust model sensitive to subtle input change. Such models become impractical because a small random perturbation on the input can lead to significantly different model output. More details with empirical study can be found in the Appendix A.

2.3 Black-box backdoor detection and its challenges

Most previous works (Wang et al., 2019; Dong et al., 2021; Chen et al., 2019b) focus on reverse-engineering the backdoor trigger for each target label yty_{t} using the following optimization:

argminm,Δi(ϕ((1m)xi+mΔ,yt),yt)+βm1,\operatorname*{arg\,min}_{m,\Delta}\sum_{i}\ell(\phi((1-m)\odot x_{i}+m\odot\Delta,y_{t}),y_{t})+\beta||m||_{1}, (3)

where β\beta is the balancing parameter. Unfortunately, solving Eq. 3 is notoriously hard in the black-box hard-label setting since θ\theta is unknown. Additional difficulty comes with the fact that an effective zero-th order gradient estimation requires each example xix_{i} superimposed with the trigger to be close to the decision boundary of yty_{t} (Chen et al., 2020a; Cheng et al., 2020; Li et al., 2020). However, such condition is hard to achieve in the hard-label blackbox settings.

3 Adversarial extreme value analysis for backdoor detection

Our AEVA framework is introduced in this section. We first derive an upper bound to the backdoor detection objective in Eq. 3 and find its connection to adversarial attack. The adversarial singularity phenomenon is further revealed in both theoretical results and empirical studies. Based on our findings, we propose the Global Adversarial Peak (GAP) measure, computed by extreme value analysis on the adversarial perturbation. The GAP score will be finally used with Median Absolute Deviation (MAD) to detect backdoors in neural networks. Monte Carlo gradient estimation is further introduced in order to solve the adversarial objective in the black-box setting.

3.1 Adversarial objective as an upper bound of the backdoor objective

We present in this section the connection between the backdoor objective in Eq. 3 with the objective in adversarial attacks. To begin with, we first transform the backdoor masking equation (1m)xi+mΔ=xi+μmxi(1-m)\odot x_{i}+m\odot\Delta=x_{i}+\mu-m\odot x_{i}, where μ=mΔ\mu=m\odot\Delta. Following this, the optimization in Eq. 3 converts to minimizing the following term:

F=1Ni=1N(ϕ(xi+μmxi,yt;θ),yt).\displaystyle F=\frac{1}{N}\sum_{i=1}^{N}\ell(\phi(x_{i}+\mu-m\odot x_{i},y_{t};\theta),y_{t})\leavevmode\nobreak\ . (4)

We further introduce an important result from the Multivariate Taylor Theorem below.

Lemma 1.

Given a continuous differentiable function f()f(\cdot) defined on n\mathbb{R}^{n} and vectors x,hnx,h\in\mathbb{R}^{n}, for any M|f/hi|,1inM\geq|\partial f/\partial h_{i}|\ ,\forall 1\leq i\leq n, we have |f(x+h)f(x)|Mh1|f(x+h)-f(x)|\leq M\|h\|_{1}.

The lemma can be proved using the Lagrange Remainder. Please refer to the Appendix B.1 for a detailed proof. According to the lemma, we have

|F1Ni=1N(ϕ(xi+μ,yt;θ),yt)|1Ni=1NCmxi1Cm1=Cμ0,\displaystyle\left|F-\frac{1}{N}\sum_{i=1}^{N}\ell(\phi(x_{i}+\mu,y_{t};\theta),y_{t})\right|\leq\frac{1}{N}\sum_{i=1}^{N}C\|m\odot x_{i}\|_{1}\leq C\|m\|_{1}\leavevmode\nobreak\ =C\|\mu\|_{0}\leavevmode\nobreak\ , (5)

where the latter inequality holds because each example xi1x_{i}\leq 1 is bounded and the last equality m1=mΔ0=μ0\|m\|_{1}=\|m\odot\Delta\|_{0}=\|\mu\|_{0} holds because mm is a binary mask. Then, we can see that the adversarial objective is an upper bound of the objective in backdoor detection, i.e., F1Ni(ϕ(xi+μ),yt)+Cμ0F\leq\frac{1}{N}\sum_{i}\ell(\phi(x_{i}+\mu),y_{t})+C\|\mu\|_{0}. Instead of optimizing the original Eq. 3, we here propose to minimize the 0\ell^{0}-regularized objective. While 0\ell^{0} enforces the sparsity of the solution, optimizing it is NP-hard. In practice, it is replaced with an 1\ell^{1}-norm as an envelope of the objective (Ramirez et al., 2013; Donoho, 2006):

μ^=argminμi(ϕ(xi+μ,yt;θ),yt)+λμ1.\hat{\mu}=\arg\min_{\mu}\sum_{i}\ell(\phi(x_{i}+\mu,y_{t};\theta),y_{t})+\lambda\|\mu\|_{1}\leavevmode\nobreak\ . (6)

This adversarial optimization can be solved using Monte Carlo gradient estimation in our black-box setting. More details are described in Section 3.4. In the following, we will first discuss how the solution to this adversarial objective can be used to detect backdoors in neural networks.

3.2 The adversarial singularity phenomenon

A natural and important question here is what the adversarial perturbation μ^\hat{\mu} would look like when the input is infected with backdoor attacks. A direct analysis on deep neural networks is hard. So, we start by analyzing a linear model to shed light on the intuition of our approach.

Lemma 2.

Given a linear model parameterized by θ\theta optimized on NoN_{o} original training examples and NbN_{b} backdoored training examples with an objective in Eq. 2 and a mean-squared-error loss, the adversarial solution μ^\hat{\mu} optimized using Eq. 6 will be dominated by input dimensions corresponding to backdoor mask mm when NbN_{b} is large, i.e., limNb(1m)μ^1/μ^1=0\lim_{N_{b}\to\infty}\|(1-m)\odot\hat{\mu}\|_{1}/\|\hat{\mu}\|_{1}=0.

The lemma can be proved by least square solutions and exact gradient computation. Detailed proof can be found in the Appendix B.2. The lemma reveals that the majority of the mass in the adversarial perturbation will be occupied in the mask area. Since the mask mm is usually sparse, it is reasonable to expect a highly skewed distribution in the adversarial map μ^\hat{\mu}.

We suspect that such skewed distributions might also occur in the adversarial map of a deep neural network. While a thorough analysis on DNN is hard, recent studies in Neural Tangent Kernel (NTK) (Jacot et al., 2018) have shown that a deep neural network with infinite width can be treated as kernel least square. So, we further extended our analysis to a kk-way kernel least square classifier as the model and a cross-entropy loss used in the adversarial analysis.

Lemma 3.

Suppose the training dataset consists of NoN_{o} original examples and NbN_{b} backdoor examples, i.i.d. sampled from uniform distribution and belonging to kk classes. Let ϕ\phi be a multivariate kernel regression with the objective in Eq. 2, an RBF kernel. Then the adversarial solution μ^\hat{\mu} to Eq. 6 under cross-entropy loss should satisfy that limNb𝔼[(1m)μ^]=0\lim_{N_{b}\to\infty}\mathbb{E}[(1-m)\odot\hat{\mu}]=0.

The proof can be found in Appendix B.3. The lemma reveals a similar result as the linear case that the majority mass of the adversarial perturbation (in expectation) stays in the mask area when sufficient backdoor examples are used. Although the lemma does not directly address a deep neural network, we hope this result could help understand the problem better in the NTK regime.

We perform an empirical study on DNNs to validate our intuition. Following previous works (Dong et al., 2021; Wang et al., 2019; 2020), we implement BadNets (Gu et al., 2019) as the backdoor attack method. We randomly select a label as the infected label and train a ResNet-44 on CIFAR-10 embedded with the backdoor attack, resulting in an attack success rate of 99.87%99.87\%. The backdoor trigger is a 4x4 square shown in Fig. 2(a), which occupies at most 1.6%1.6\% of the whole image. The adversarial perturbations μ^\hat{\mu} (by optimizing Eq. 6) are generated for both infected and uninfected labels. The perturbations are normalized to a heat map as |μ^|/μ^1|\hat{\mu}|/||\hat{\mu}||_{1} and shown in Fig. 2. A high concentration of mass is clearly observed from the infected case in Figure 2(b). We call it adversarial singularity phenomenon. This motivates our adversarial extreme value analysis (AEVA) method.

Refer to caption
(a) Backdoor map Δ\Delta
Refer to caption
(b) μ^\hat{\mu} for the infected label
Refer to caption
(c) μ^\hat{\mu} for the uninfected label
Figure 2: The distribution of values in the normalized adversarial perturbation μ\mu for the infected (b) and uninfected (c) labels, using a backdoor map in (a). x,yx,y axes represent the image space; zz axis represents the absolute value of the normalized μ\mu corresponding to each pixel location.

3.3 Adversarial extreme value analysis

Refer to caption
Figure 3: The distributions of adversarial peak μmax\mu_{max} for infected labels and uninfected labels.

A natural way to identify singularity is to look at the peak value of the distribution. We define the adversarial peak μmax=maxμij\mu_{max}=\max\mu_{ij} for an adversarial perturbation μ\mu. To gain a statistical understanding of the relationship between adversarial peak and backdoor attacks, We randomly sample 10001000 normalized μ\mu for the uninfected and infected labels and plot the distribution of μmax\mu_{max} in Fig. 3. The experimental setup is consistent with the empirical study in Sec. 3.2. We notice from the figure that μmax\mu_{max} of the infected labels reaches a much larger range compared to that of uninfected labels. However, they still overlap; around 47%47\% infected μmax\mu_{max} stays within the uninfected range (the area with both blue and orange bars). This result suggests that

the adversarial singularity phenomenon does not always occur in backdoor-infected DNNs. It is reasonable because some samples are inherently vulnerable to adversarial attacks and their adversarial perturbations have similar properties as backdoors, which reduces the chance of singularity.

Now, we know if we define a threshold TT as the maximum value of the uninfected labels, the probability of a backdoor-infected μmax\mu_{max} being smaller than TT is 0.470.47, i.e., P(μmax<T)=0.47P(\mu_{max}<T)=0.47. Taking one example is impossible to know whether a label is infected. However, according to the Univariate Theory (Gomes & Guillou, 2015), if we sample kk examples of μmax\mu_{max} for the same label,

P(max{μmax1,μmax2,,μmaxk}<T)=(P(μmax<T))k.\vspace{-2mm}P(\max\{\mu_{max}^{1},\mu_{max}^{2},\ldots,\mu_{max}^{k}\}<T)=(P(\mu_{max}<T))^{k}\leavevmode\nobreak\ . (7)
Refer to caption
Figure 4: The proportion of infected labels whose max{μmaxi}\max\{\mu_{max}^{i}\} larger than that of uninfected ones.

Here we name the maximum value over all the kk adversarial peaks as the Global Adversarial Peak (GAP). We vary the choice of kk in Figure 4 and we can see that the chance of GAP value being lower than TT is diminishing. For example, if we take k=6k=6, then P(GAP<T)=0.476=0.01P(GAP<T)=0.47^{6}=0.01 and the success rate of identifying a backdoor-infected label is 99%99\%. Please be advised that, given the long-tail distribution of the infected μmax\mu_{max}, this threshold TT is not sensitive and can be made even larger to be safe to include all uninfected labels. However, for larger TT values, we need a larger kk to ensure a desired success rate.

3.4 Black-box optimization via gradient estimation

After investigating the unique properties of backdoor-infected DNNs in the white-box settings, we here illustrate how to exploit such properties under a rather restricted scenario as we consider. In the white-box settings, the free access to θ\theta can enable us to compute the xϕ(x,yt)\nabla_{x}\phi(x,y_{t}) accurately, resulting in μ\mu with minimum μ1||\mu||_{1} through optimizing Eq. 6. Regarding the black-box settings, we propose to leverage the zero-order gradient estimation technique (Fu & Hu, 2012) to address the challenge for calculating xϕ(x,yt)\nabla_{x}\phi(x,y_{t}). We choose Monte Carlo based method (Fu & Hu, 2012) for obtaining the estimated gradient ϕ~x(x,yt)\widetilde{\nabla\phi}_{x}(x,y_{t}), which sends several inputs and utilizes the corresponding outputs to estimate the gradient. More details about gradient estimation can be found in Appendix C.

3.5 Final algorithm

Refer to caption
Figure 5: Overview of AEVA – Adversarial Extreme Value Analysis.

Putting all the above ideas together, we present the AEVA framework (illustrated in Fig. 5). The algorithm for computing the GAP value of a given label is described in Algorithm 1. Our approach requires a batch of legitimate samples XiX_{i} for each class ii. For each label yty_{t} to be analyzed, the algorithm will tell whether yty_{t} is backdoor-infected. We treat yty_{t} as the targeted label. Then we collect a batch of legitimate samples belonging to yty_{t}, denoted as XtX_{t} with the same batch size as that of XiX_{i}. Then, we leverage XtX_{t} to make each sample within every {Xi}i=0n\{X_{i}\}_{i=0}^{n} to approach the decision boundary of yty_{t}. This is computed by optimizing μ^\hat{\mu} from Eq. 6 using Monte Carlo gradient estimation. The adversarial peak μmax\mu_{max} is then computed for each example. The global adversarial peak (GAP) is aggregated over all labels. After calculating the GAP value (yi)\mathcal{R}(y_{i}) for each class ii, following previous work (Wang et al., 2019; Dong et al., 2021), we implement Median Absolute Deviation (MAD) to detect the outliers among {(yi)}i=1n\{\mathcal{R}(y_{i})\}_{i=1}^{n}. Specifically, we use MAD to calculate the anomalous index for each (yi)\mathcal{R}(y_{i}) by assuming {(yi)}i=1n\{\mathcal{R}(y_{i})\}_{i=1}^{n} fits Gaussian distribution. The outlier is then detected by thresholding the anomaly index.

4 Experiments

4.1 Settings

Datasets. We evaluate our approach on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. For each task, We build 240 infected and 240 benign models with different architectures: ResNets (He et al., 2016) and DenseNets (Huang et al., 2017). We randomly select one label for each backdoor-infected model and inject sufficient poisoned samples to ensure the attack success rate 98%\geq 98\%. More details can be found in the Appendix.

Attack methods. Four different attack methods are implemented: BadNets (Gu et al., 2019), Watermark attack (Chen et al., 2017), Label-Consistent Attack (Turner et al., 2019) and Invisible Attack (Li et al., 2021a). For BadNets and Watermark, the triggers are 4×44\times 4 squares. The transparency ratio for Watermark attack is 0.1. For label-consistent and invisible attacks, we use the default less-visible triggers. Around 10%10\% poison data is injected in training. Each model is trained for 200 epochs with data augmentation. More details can be found in the Appendix G.

Baseline methods. Since there is no existing black-box hard-label method, we compare our approach with two state-of-art white box backdoor detection methods: Neural Cleanse (NC) (Wang et al., 2019) and DL-TND (Wang et al., 2020)222We implement NC (Wang et al., 2019) and DL-TND (Wang et al., 2020) following https://github.com/bolunwang/backdoor and https://github.com/wangren09/TrojanNetDetector. For each task, we use 200 samples for the gradient estimation, and the batch size for each {Xi}i=1n\{X_{i}\}_{i=1}^{n} is set to 40 in Algorithm 1.

Outlier detection. To accurately identify the anomalous (yi)\mathcal{R}(y_{i}) among {(yi)}i=1n\{\mathcal{R}(y_{i})\}_{i=1}^{n}, we assume the scores fit to a normal distribution and apply a constant estimator (1.4826) to normalize the computed anomaly index similar to (Wang et al., 2019). We set the threshold value τ=4\tau=4 for our MAD detector, which means we identify the class whose corresponding anomaly index larger than 4 as infected. This value is chosen using a hold-out validation set of infected and benign models.

Evaluation metrics. We follow previous work (Kolouri et al., 2020; Wang et al., 2020; Dong et al., 2021; Guo et al., 2020) on using two common metrics: (a) The Area under Receiver Operating Curve (AUROC) – The Receiver Operating Curve (ROC) shows the trade-off between detection success rate for infected models and detection error rate for benign models across different decision thresholds τ\tau for anomaly index; (b) Detection Accuracy (ACC) – The proportion of models are correctly identified. Regarding infected models, they are correctly identified if and only if the infected labels are identified without mistagging other uninfected labels.

Refer to caption
Figure 6: The visualization of adversarial perturbation μ^\hat{\mu} for infected labels with different backdoor attacks on TinyImageNet. For each attack, we show the trigger (left), μ^\hat{\mu} map (middle) and μ^\hat{\mu} distribution in 3D (right).
Task
Detection Accuracy (ACC)
BadNets Watermark Label Consistent Invisible Attack Total(Infected) Benign
CIFAR-10 86.7%86.7\% 93.3%93.3\% 93.3%93.3\% 93.3%93.3\% 91.7%91.7\% 95.4%95.4\%
CIFAR-100 93.3%93.3\% 100%100\% 96.7%96.7\% 96.7%96.7\% 96.7%96.7\% 97.5%97.5\%
TinyImageNet 96.7%96.7\% 100%100\% 93.3%93.3\% 96.7%96.7\% 96.7%96.7\% 99.2%99.2\%
Table 1: Overall performance of our approach on three tasks.
Refer to caption
(a) CIFAR-10
Refer to caption
(b) CIFAR-100
Refer to caption
(c) TinyImageNet
Figure 7: The Receiver Operating Cure(ROC) for NC, DL-TND and AEVA on CIFAR-10, CIFAR-100 and TinyImageNet tasks. AEVA is black-box while the other two methods are white-box.

4.2 Results

We first investigate whether AEVA can reveal the singularity phenomenon of labels infected with different attack approaches. We randomly select an infected model infected by each attack approach and an uninfected model for TinyImageNet task. Notably, all these selected models are correctly identified by our approach. We plot the corresponding normalized adversarial perturbation μ\mu in Fig. 6, which demonstrates that AEVA can accurately and distinguish the uninfected and infected labels and reveal the singularity phenomena of backdoor triggers.

Table 1 presents the overall results. AEVA can accurately detect the infected label with ACC 86.7%\geq 86.7\% across all three tasks and various trigger settings. We compare our approach with existing white-box detection approaches including Neural Cleanse (NC) (Wang et al., 2019) and DL-TND (Wang et al., 2020). The comparison results over all infected and benign models are shown in Fig. 7. More results can be found in Appendix J. The results suggest that AEVA achieves comparable performance with existing white-box detection approaches across different settings. Such close performance also indicates the efficacy of AEVA on black-box hard-label backdoor detection.

4.3 Ablation Study

Refer to caption
Figure 8: The impact of trigger size on detection accuracy.

The impact of trigger size. We test AEVA on TinyImageNet with different trigger sizes for BadNets and Watermark attacks. For each trigger size and attack method, we build 60 infected models with various architectures following the configurations in Sec. 4.2. Fig. 8 shows that AEVA remains effective when trigger size is less than 14×1414\times 14 with ACC 71.7%\geq 71.7\%. For large triggers, AEVA cannot identify the infected label since the singularity property is alleviated. This is consistent with our theoretical analysis. Even though large trigger attacks can bypass AEVA, they are either visually-distinguishable or leading to a non-robust model sensitive to input change, making it less stealthy and impractical.

Refer to caption
Figure 9: The impact of infected label size on detection accuracy.

The impact of infected label numbers. We further investigate the impact of the number of infected labels. We build 60 infected models for CIFAR-10 to evaluate AEVA. We randomly select infected labels, and we inject a 2×22\times 2 trigger for each infected label. Fig. 9 shows that AEVA performs effectively when the number of infected labels is less than 3 (i.e., 30%30\% of the entire labels) with ACC 78.3%\geq 78.3\%. It is because too many infected labels fail the MAD outlier detection. Interestingly, existing white-box detection methods (Wang et al., 2019; Chen et al., 2019b; Guo et al., 2020; Liu et al., 2019; Wang et al., 2020) cannot perform effectively in such scenarios either. However, too many infected labels would reduce the stealth of the attacker as well.

Besides these studies, we further evaluate our approach under more scenarios, including the impact of the number of available labels, multiple triggers for a single infected label, different backdoor trigger shapes and potential adaptive backdoor attacks, which are included in the Appendix.

5 Conclusion

This paper takes a first step addressing the black-box hard-label backdoor detection problem. We propose the adversarial extreme value analysis (AEVA) algorithm, which is based on an extreme value analysis on the adversarial map, computed from the monte-carlo gradient estimation due to the black-box hard-label constraint. Extensive experiments demonstrate the efficacy of AEVA across a set of popular tasks and state-of-the-art backdoor attacks.

Ethics Statement

Our work aims at providing neural network practitioners additional protection against backdoor attacks. We believe our work could contribute positively to the human society and avoid potential harm since it addresses a critical safety problem. We are unaware of any direct negative impact out of this work. Our method has certain limitations such as the sensitivity to large backdoor trigger and multiple infected labels. However, as we earlier in the paper, these scenarios make the attack become less stealthy and not practical in real applications.

Reproducibility Statement

Our work is built upon comprehensive theoretical results and clear motivations. We believe the proposed method can be reproduced according to the content in the paper, e.g., Algorithm 1. In addition, we have released the implementation of AEVA in https://github.com/JunfengGo/AEVA-Blackbox-Backdoor-Detection-main.

References

  • Chen et al. (2019a) Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. In SafeAI@AAAI, 2019a. URL http://ceur-ws.org/Vol-2301/paper_18.pdf.
  • Chen et al. (2019b) Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks. In IJCAI, pp.  4658–4664, 2019b.
  • Chen et al. (2020a) Jianbo Chen, Michael I Jordan, and Martin J Wainwright. Hopskipjumpattack: A query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp), pp. 1277–1294. IEEE, 2020a.
  • Chen et al. (2020b) Steven Chen, Nicholas Carlini, and David Wagner. Stateful detection of black-box adversarial attacks. In Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence, pp.  30–39, 2020b.
  • Chen et al. (2017) Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. CoRR, abs/1712.05526, 2017. URL http://arxiv.org/abs/1712.05526.
  • Chen et al. (2019c) Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen. Cloud-dnn: An open framework for mapping dnn models to cloud fpgas. In Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, pp.  73–82, 2019c.
  • Cheng et al. (2020) Minhao Cheng, Simranjit Singh, Patrick H. Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui Hsieh. Sign-opt: A query-efficient hard-label adversarial attack. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SklTQCNtvS.
  • Dong et al. (2021) Yinpeng Dong, Xiao Yang, Zhijie Deng, Tianyu Pang, Zihao Xiao, Hang Su, and Jun Zhu. Black-box detection of backdoor attacks with limited information and data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  16482–16491, October 2021.
  • Donoho (2006) D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. doi: 10.1109/TIT.2006.871582.
  • Fowers et al. (2018) Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp.  1–14. IEEE, 2018.
  • Fu & Hu (2012) Michael C Fu and Jian-Qiang Hu. Conditional Monte Carlo: Gradient estimation and optimization applications, volume 392. Springer Science & Business Media, 2012.
  • Gomes & Guillou (2015) M Ivette Gomes and Armelle Guillou. Extreme value theory and statistics of univariate extremes: a review. International statistical review, 83(2):263–292, 2015.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • Gu et al. (2019) Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, pp.  47230–47244, 2019. doi: 10.1109/ACCESS.2019.2909068.
  • Guo et al. (2020) Wenbo Guo, Lun Wang, Yan Xu, Xinyu Xing, Min Du, and Dawn Song. Towards inspecting and eliminating trojan backdoors in deep neural networks. In 2020 IEEE International Conference on Data Mining (ICDM), pp.  162–171, 2020. doi: 10.1109/ICDM50108.2020.00025.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  • Huang et al. (2022) Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via decoupling the training process. In ICLR, 2022.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  • Ji et al. (2018) Yujie Ji, Xinyang Zhang, Shouling Ji, Xiapu Luo, and Ting Wang. Model-reuse attacks on deep learning systems. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp.  349–363, 2018.
  • Kolouri et al. (2020) Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing backdoor attacks in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  301–310, 2020.
  • Leadbetter (1991) M.R. Leadbetter. On a basis for peaks over threshold modeling. Statistics and Probability Letters, 12(4):357–362, 1991. ISSN 0167-7152. doi: https://doi.org/10.1016/0167-7152(91)90107-3. URL https://www.sciencedirect.com/science/article/pii/0167715291901073.
  • Li et al. (2020) Huichen Li, Xiaojun Xu, Xiaolu Zhang, Shuang Yang, and Bo Li. Qeba: Query-efficient boundary-based blackbox attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Li et al. (2021a) Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Transactions on Dependable and Secure Computing, 18(5):2088–2105, 2021a. doi: 10.1109/TDSC.2020.3021407.
  • Li et al. (2022) Yiming Li, Haoxiang Zhong, Xingjun Ma, Yong Jiang, and Shu-Tao Xia. Few-shot backdoor attacks on visual object tracking. In ICLR, 2022.
  • Li et al. (2021b) Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In ICCV, 2021b.
  • Liu et al. (2018) Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018. The Internet Society, 2018.
  • Liu et al. (2019) Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. Abs: Scanning neural networks for back-doors by artificial brain stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp.  1265–1282, 2019.
  • Liu et al. (2020) Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In European Conference on Computer Vision, pp.  182–199. Springer, 2020.
  • Masi et al. (2018) Iacopo Masi, Yue Wu, Tal Hassner, and Prem Natarajan. Deep face recognition: A survey. In 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pp.  471–478. IEEE, 2018.
  • Okuyama et al. (2018) Takafumi Okuyama, Tad Gonsalves, and Jaychand Upadhay. Autonomous driving system based on deep q learnig. In 2018 International Conference on Intelligent Autonomous Systems (ICoIAS), pp.  201–205. IEEE, 2018.
  • Ramirez et al. (2013) Carlos Ramirez, Vladik Kreinovich, and Miguel Argaez. Why l1 is a good approximation to l0: A geometric explanation. 7:203–207, 01 2013.
  • Saha et al. (2020) Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. Hidden trigger backdoor attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  11957–11965, 2020.
  • Singh et al. (2020) Shrutika Singh, Harshita Arya, and P Arun Kumar. Voice assistant for ubuntu implementation using deep neural network. In Advanced Computing Technologies and Applications, pp. 11–20. Springer, 2020.
  • Szegedy et al. (2013) Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. 2013.
  • Tran et al. (2018) Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/280cf18baf4311c92aa5a042336587d3-Paper.pdf.
  • Turner et al. (2019) Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771, 2019.
  • Wang et al. (2019) Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723. IEEE, 2019.
  • Wang et al. (2020) Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical detection of trojan neural networks: Data-limited and data-free cases. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Wierstra et al. (2014) Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
  • Yao et al. (2019) Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y. Zhao. Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, pp.  2041–2055, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367479. doi: 10.1145/3319535.3354209. URL https://doi.org/10.1145/3319535.3354209.
  • Zhai et al. (2021) Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. Backdoor attack against speaker verification. In ICASSP, 2021.
\doparttoc\faketableofcontents

Part

\parttoc\parttoc

Appendix

Part

\parttoc

Appendix A Other Backdoor Attacks

A.1 Dense Watermark Backdoor Attacks.

In short, we argue that low-magnitude dense backdoor attack(watermark) is not a practical attack method and is less concerned compared to the threat model in our paper.

First, the low-magnitude attack is a special case of our formulation where the mask m is a continuous value (here we use aa to represent transparency). The low-magnitude problem can be formulated as:

mina,Δi(ϕ((1a)xi,aΔ),yt)+λ|a|\min_{a,\Delta}\sum_{i}\ell(\phi((1-a)x_{i},a\Delta),y_{t})+\lambda|a|

This formulation does not have discrete variables so its black-box optimization is an easier task compared to our problem (where m is discrete).

Second, we will show theoretically that, in this scenario, every normal example is near the decision boundary of the network, because due to Taylor expansion and the fact that a is a small value,

ϕ((1a)xi+aΔ)=ϕ(xi+a(Δxi))ϕ(xi)+a(Δxi)xϕ\phi((1-a)x_{i}+a\Delta)=\phi(x_{i}+a(\Delta-x_{i}))\approx\phi(x_{i})+a(\Delta-x_{i})^{\intercal}\nabla_{x}\phi

We know from xix_{i} to (1a)xi+aΔ(1-a)x_{i}+a\Delta, the function output label is changed. So every example xix_{i} is at the decision boundary of the function ϕ\phi.

So the infected model is actually very sensitive to random noise and becomes ineffective in attacking purposes.

To empirically support this claim, we conduct experiments on CIFAR-10 using ResNet-44. The results are reported over average among ten different target labels. We select watermark triggers randomly which occupy the entire image with 0.1 transparency rate (a). We here add noise which fits uniform distribution with a low magnitude as 0.025 to each normal sample. The proportion of normal samples misclassified by the infected label is shown below. We find that such dense low-magnitude watermark triggers would make normal samples to be misclassified as the target label yty_{t} via adding small random noise. Such observations can easily identify models infected with a dense and low-magnitude watermark trigger.

  • Using a global watermark trigger, 67.8%67.8\% of the normal samples are misclassified to other labels(Infected label)

  • Using a 16x16 watermark trigger, 6.42%6.42\% of the normal samples are misclassified to other labels(Infected label)

  • Using a 4x4 watermark trigger leads to 1.41%1.41\% of the normal samples misclassified

  • Using a 2x2 watermark trigger leads to 0.24%0.24\% of the normal samples misclassified

  • Training with normal examples results in 0.2%0.2\% of the normal samples misclassified.

A.2 Feature Space Backdoor

Feature space backdoor attack which employs a generative model to transfer the style of a normal image into a trigger. However, the style transfer leads to significant visual changes which becomes less stealthy. For physical attacks like surveillance cameras, this approach is hard to implement because it requires changing the background as well. While our work is mostly focused on the input space trigger, we believe the latent space problem is beyond our threat model as well as focused problem and could be an interesting future work (for example, consider using an off-the-shelf style transfer model to evaluate the model’s sensitivity towards different style effects.

Appendix B Proofs

B.1 The proof of Lemma 1

Lemma 1.

Given a continuous differential function ff defined on n\mathbb{R}^{n} and vectors 𝐱,𝐡n\mathbf{x},\mathbf{h}\in\mathbb{R}^{n}, there exists a real value M|f/hi|1inM\geq|\partial f/\partial h_{i}|\,\forall 1\leq i\leq n such that

|f(𝐱+𝐡)f(𝐱)|M𝐡1.|f(\mathbf{x}+\mathbf{h})-f(\mathbf{x})|\leq M\|\mathbf{h}\|_{1}\leavevmode\nobreak\ . (8)
Proof of Lemma 1:

The first order Taylor expansion gives

f(𝐱+𝐡)=f(𝐱)+R(𝐡)f(\mathbf{x}+\mathbf{h})=f(\mathbf{x})+R(\mathbf{h}) (9)

where R(𝐡)R(\mathbf{h}) is the Lagrange remainder, i.e.,

R(𝐡)=i=1nf(𝐱+c𝐡)xihi,for some c(0,1).R(\mathbf{h})=\sum_{i=1}^{n}\frac{\partial f(\mathbf{x}+c\mathbf{h})}{\partial x_{i}}\cdot h_{i},\quad\text{for some }c\in(0,1). (10)

Let M|f(𝐱)/xi|𝐱M\geq|\partial f(\mathbf{x})/\partial x_{i}|\,\forall\mathbf{x}, we have

|R(𝐡)|M|1nhi|=M𝐡1.|R(\mathbf{h})|\leq M\left|\sum_{1}^{n}h_{i}\right|=M\|\mathbf{h}\|_{1}\leavevmode\nobreak\ . (11)

So, we have

|f(𝐱+𝐡)f(𝐱)|M𝐡1.|f(\mathbf{x}+\mathbf{h})-f(\mathbf{x})|\leq M\|\mathbf{h}\|_{1}\leavevmode\nobreak\ . (12)

\square

B.2 The proof of Lemma 2 – a theory on the linear case

We perform a theoretical study of linear models to shed light about our empirical finding. Here, we restate Lemma 3 below for convenience.

Lemma 2.

Given a linear model parameterized by θ\theta optimized on NoN_{o} original training examples and NbN_{b} backdoored training examples with an objective in Eq. 2 and a mean-squared-error loss, the adversarial solution μ^\hat{\mu} optimized using Eq. 6 will be dominated by input dimensions corresponding to backdoor mask mm when NbN_{b} is large, i.e.,

limNb(1m)μ^1μ^1=0.\lim_{N_{b}\to\infty}\frac{\|(1-m)\odot\hat{\mu}\|_{1}}{\|\hat{\mu}\|_{1}}=0\leavevmode\nobreak\ . (13)
Proof of Lemma 2:

To ease the reading of notations, we use ()a×b(\cdot)_{a\times b} to represent matrix with aa rows and bb columns, respectively. Assuming ϕ(x;θ)=θx\phi(x;\theta)=\theta^{\intercal}x is a linear model whose output is a floating number, and the loss function is 2\ell^{2}, i.e., (x,y)=(xy)2\ell(x,y)=(x-y)^{2}. Then Eq. 2 becomes a linear regression problem and can be represented in the matrix form of least square, i.e.,

F=XθY2+X^θY^2F=\|X\theta-Y\|^{2}+\|\hat{X}\theta-\hat{Y}\|^{2} (14)

where x^i=(1m)xi+mΔ\hat{x}_{i}=(1-m)\odot x_{i}^{\prime}+m\odot\Delta is an backdoor example converted from a normal example xix_{i}^{\prime} and y^i=yt\hat{y}_{i}=y_{t}. XX is the matrix composed of row vectors xix_{i} and YY is the matrix composed of row vectors yiy_{i}. The solution of this least square can be found by setting gradient to zero, such that

θ^=(XX+X^X^)1(XY+X^Y^).\displaystyle\hat{\theta}=(X^{\intercal}X+\hat{X}^{\intercal}\hat{X})^{-1}(X^{\intercal}Y+\hat{X}^{\intercal}\hat{Y})\leavevmode\nobreak\ . (15)

It is known that a normal training without backdoors will lead to the pseudo-inverse solution θ=(XX)1XY\theta=(X^{\intercal}X)^{-1}X^{\intercal}Y. Without loss of generality, we assume the backdoor trigger occupies the last qq dimensions. And the first pp dimensions are normal values, i.e., q=dp+1q=d-p+1 where dd is the total number of dimensions. So, we can rewrite the input X=[X1:p,Xp:d]X=[X_{1:p},X_{p:d}] and the mask m=[𝟎1×p,𝟏1×(dp+1)]m=[\mathbf{0}_{1\times p},\mathbf{1}_{1\times(d-p+1)}]. Let 𝜹\bm{\delta} be the sub-vector of Δ\Delta such that 𝜹=Δp+1:d\bm{\delta}=\Delta_{p+1:d}, a qq dimensional row vector. Then we have X^=[X1:p,𝟏Nb×1𝜹]\hat{X}=[X_{1:p},\mathbf{1}_{N_{b}\times 1}\bm{\delta}] where NbN_{b} is the total number of backdoor examples. Then,

X^X^=[X1:pX1:p00Nb𝜹𝜹].\displaystyle\hat{X}^{\intercal}\hat{X}=\begin{bmatrix}{X_{1:p}}^{\intercal}X_{1:p}&0\\ 0&N_{b}\bm{\delta}^{\intercal}\bm{\delta}\end{bmatrix}\leavevmode\nobreak\ . (16)

Similarly, we find Y^=𝟏yt\hat{Y}=\mathbf{1}y_{t} and assume zero mean of the input, i.e., xi=0\sum x_{i}=0, then we have

X^Y^=\displaystyle\hat{X}^{\intercal}\hat{Y}= [𝟎p×1Nbyt𝜹].\displaystyle\begin{bmatrix}\mathbf{0}_{p\times 1}\\ N_{b}y_{t}\bm{\delta}^{\intercal}\end{bmatrix}\leavevmode\nobreak\ . (17)

While NbN_{b} becomes larger, both X^X^\hat{X}^{\intercal}\hat{X} and X^Y^\hat{X}^{\intercal}\hat{Y} increase significantly. When NbN_{b}\rightarrow\infty, Eq. 15 becomes

θ^(X^X^)1(X^Y^)=[𝟎p×1yt(𝜹𝜹)1𝜹].\hat{\theta}\rightarrow(\hat{X}^{\intercal}\hat{X})^{-1}(\hat{X}^{\intercal}\hat{Y})=\begin{bmatrix}\mathbf{0}_{p\times 1}\\ y_{t}(\bm{\delta}^{\intercal}\bm{\delta})^{-1}\bm{\delta}^{\intercal}\end{bmatrix}. (18)

In another word, we know the normalized values in the non-masked area of θ\theta approach 0 such that

limNb(1m)θ^1θ^1=0.\lim_{N_{b}\to\infty}\frac{\|(1-m)\odot\hat{\theta}\|_{1}}{\|\hat{\theta}\|_{1}}=0\leavevmode\nobreak\ . (19)

Now, let us look at the adversarial objective in Eq. 8, which is equivalent to

G=θ(x+μ)yt2,\displaystyle G=\|\theta^{\intercal}(x+\mu)-y_{t}\|^{2}\leavevmode\nobreak\ , (20)

the gradient of which is

μG=2(θ(x+μ)yt)θ.\displaystyle\nabla_{\mu}G=2(\theta^{\intercal}(x+\mu)-y_{t})\theta\leavevmode\nobreak\ . (21)

Since 2(θ(x+μ)yt)2(\theta^{\intercal}(x+\mu)-y_{t}) is a scalar and each of the gradient updates is conducted by μ=μ+λμG\mu=\mu+\lambda\nabla_{\mu}G, we know no matter how many gradient update steps are applied, μ\mu, initialized at 0, always moves in the exact same direction of θ\theta, i.e., μ^θ\hat{\mu}\parallel\theta. In this case, μ^\hat{\mu} will also be dominant by the last qq dimensions, or the dimensions corresponded to the mask mm. In another word, we have

limNb(1m)μ^1μ^1=0.\lim_{N_{b}\to\infty}\frac{\|(1-m)\odot\hat{\mu}\|_{1}}{\|\hat{\mu}\|_{1}}=0\leavevmode\nobreak\ . (22)

\square

B.3 Proof of Lemma 3 – An analysis for Kernel Regression

Lemma 3.

Suppose the training dataset consists of NoN_{o} original examples and NbN_{b} backdoor examples, i.i.d. sampled from uniform distribution and belonging to kk classes. Each class contains equally No/kN_{o}/k normal examples. Let ϕ\phi be a multivariate kernel regression with the objective in Eq. 2, an RBF kernel K(,)K(\cdot,\cdot) Then the adversarial solution μ^\hat{\mu} to Eq. 6 under cross-entropy loss (y^,y)=iyilogy^i\ell(\hat{y},y)=-\sum_{i}y_{i}\log\hat{y}_{i} (y{0,1}ky\in\{0,1\}^{k} is the one-hot label vector) should satisfy that

limNb𝔼[(1m)μ^]=𝟎.\displaystyle\lim_{N_{b}\to\infty}\mathbb{E}[(1-m)\odot\hat{\mu}]=\mathbf{0}\leavevmode\nobreak\ . (23)
Proof of Lemma 3:

The output of ϕ\phi is a kk dimensional vector. Let us assume ϕt()\phi_{t}(\cdot)\in\mathbb{R} be the output corresponding to the target class tt. We know the kernel regression solution is

ϕt()=i=1NoK(,xi)yi,t+i=1NbK(,x^i)y^i,ti=1NoK(,xi)+i=1NbK(,x^i)\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{o}}K(\cdot,x_{i})y_{i,t}+\sum_{i=1}^{N_{b}}K(\cdot,\hat{x}_{i})\hat{y}_{i,t}}{\sum_{i=1}^{N_{o}}K(\cdot,x_{i})+\sum_{i=1}^{N_{b}}K(\cdot,\hat{x}_{i})} (24)

where x^i=(1m)xi+mΔ\hat{x}_{i}=(1-m)\odot x_{i}+m\odot\Delta is the backdoored example and y^i\hat{y}_{i} is the corresponding one-hot label (so we always have y^i,t=1\hat{y}_{i,t}=1). Because the examples are evenly distributed, there are N/kN/k examples labeled positive for class tt. Without loss of generality, we assume yj,t=1y_{j,t}=1 when j[1,No/k]j\in[1,N_{o}/k] and yj,t=0y_{j,t}=0 otherwise. Then the regression solution becomes

ϕt()=i=1No/kK(,xi)+i=1NbK(,x^i)i=1NoK(,xi)+i=1NbK(,x^i)\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{o}/k}K(\cdot,x_{i})+\sum_{i=1}^{N_{b}}K(\cdot,\hat{x}_{i})}{\sum_{i=1}^{N_{o}}K(\cdot,x_{i})+\sum_{i=1}^{N_{b}}K(\cdot,\hat{x}_{i})} (25)

Please note the regression above is derived with a mean squared loss. In the adversarial analysis Eq. 6, we can use any alternative loss function. In this Lemma, we assume the loss in the adversarial analysis is cross-entropy, which is a common choice. So, we have

(ϕ(x),yt)\displaystyle\ell(\phi(x),y_{t}) =logϕt(x)\displaystyle=-\log\phi_{t}(x) (26)
=log(St+S^)+log(S+S^)\displaystyle=-\log(S_{t}+\hat{S})+\log(S+\hat{S}) (27)

where Si()=yj=iK(,xj)S_{i}(\cdot)=\sum_{y_{j}=i}K(\cdot,x_{j}) and S=S1+S2++SkS=S_{1}+S_{2}+\ldots+S_{k} and S^=i=1nbK(,x^i)\hat{S}=\sum_{i=1}^{n_{b}}K(\cdot,\hat{x}_{i}).

The derivative of loss w.r.t. xx becomes

(ϕ(x))x=1St+S^(St+S^)x+1S+S^(S+S^)x\frac{\partial\ell(\phi(x))}{\partial x}=-\frac{1}{S_{t}+\hat{S}}\frac{\partial(S_{t}+\hat{S})}{\partial x}+\frac{1}{S+\hat{S}}\frac{\partial(S+\hat{S})}{\partial x} (28)

By using gradient descent, μ^\hat{\mu} moves along the negative direction of the loss gradient, i.e.,

Δμ\displaystyle\Delta\mu =1St+S^(St+S^)x1S+S^(S+S^)x\displaystyle=\frac{1}{S_{t}+\hat{S}}\frac{\partial(S_{t}+\hat{S})}{\partial x}-\frac{1}{S+\hat{S}}\frac{\partial(S+\hat{S})}{\partial x} (29)
=S^x(1St+S^1S+S^)+Stx(1St+S^1S+S^)1S+S^(SSt)x\displaystyle=\frac{\partial\hat{S}}{\partial x}\left(\frac{1}{S_{t}+\hat{S}}-\frac{1}{S+\hat{S}}\right)+\frac{\partial S_{t}}{\partial x}\left(\frac{1}{S_{t}+\hat{S}}-\frac{1}{S+\hat{S}}\right)-\frac{1}{S+\hat{S}}\frac{\partial(S-S_{t})}{\partial x} (30)
=aS^x+aStxb(SSt)x\displaystyle=a\frac{\partial\hat{S}}{\partial x}+a\frac{\partial S_{t}}{\partial x}-b\frac{\partial(S-S_{t})}{\partial x} (31)

where a,ba,b are both positive scalar values. When NbN_{b} becomes large, the first term dominates the gradient direction, i.e., ΔμS^x\Delta\mu\propto\frac{\partial\hat{S}}{\partial x}.

Since we assume the kernel K(x,x)=exp(γxx2)K(x,x^{\prime})=\exp(-\gamma\|x-x^{\prime}\|^{2}) is RBF, then we have its derivative

K(x,x)x=2γK(x,x)(xx)\displaystyle\frac{\partial K(x,x^{\prime})}{\partial x}=-2\gamma K(x,x^{\prime})(x-x^{\prime}) (32)

Then we have

S^x\displaystyle\frac{\partial\hat{S}}{\partial x} =i=1NbK(x,x^i)x\displaystyle=\sum_{i=1}^{N_{b}}\frac{\partial K(x,\hat{x}_{i})}{\partial x} (33)
=i=1Nb2γK(x,x^i)(xx^i)\displaystyle=\sum_{i=1}^{N_{b}}-2\gamma K(x,\hat{x}_{i})(x-\hat{x}_{i}) (34)

Without loss of generality, we assume the mask m=[𝟎1×p,𝟏1×(dp+1)]m=[\mathbf{0}_{1\times p},\mathbf{1}_{1\times(d-p+1)}] (where first pp dimensions are 0s and the remaining are 1s). Then the first pp dimensions of x^i\hat{x}_{i} are the same as those of xix_{i}, while the remaining dimensions equivalent to those of Δ\Delta.

Case 1: p+1jdp+1\leq j\leq d.

The jj-th dimension of S^x\frac{\partial\hat{S}}{\partial x} can be written as

(S^μ)j=(2γi=1NbK(z+μ,x^i))(zj+μjδj)\left(\frac{\partial\hat{S}}{\partial\mu}\right)_{j}=\left(-2\gamma\sum_{i=1}^{N_{b}}K(z+\mu,\hat{x}_{i})\right)(z_{j}+\mu_{j}-\delta_{j}) (35)

So at convergence, μj=δjzj\mu_{j}=\delta_{j}-z_{j} for the last dp+1d-p+1 dimensions. The expectation is 𝔼[μj]=δj𝔼[zj]\mathbb{E}[\mu_{j}]=\delta_{j}-\mathbb{E}[z_{j}].

Case 2: 1jp1\leq j\leq p.

For the first pp dimensions, we have

(S^μ)j\displaystyle\left(\frac{\partial\hat{S}}{\partial\mu}\right)_{j} =2γi=1NbK(z+μ,x^i)(zj+μjxij)\displaystyle=-2\gamma\sum_{i=1}^{N_{b}}K(z+\mu,\hat{x}_{i})(z_{j}+\mu_{j}-x_{ij}) (36)
=2γi=1Nbeγz+μx^i2(zj+μjxij).\displaystyle=-2\gamma\sum_{i=1}^{N_{b}}e^{-\gamma\|z+\mu-\hat{x}_{i}\|^{2}}(z_{j}+\mu_{j}-x_{ij})\leavevmode\nobreak\ . (37)

Since μs+zs=δs=x^is=xis\mu_{s}+z_{s}=\delta_{s}=\hat{x}_{is}=x_{is} for p+1sdp+1\leq s\leq d, then we have

(S^μ)j\displaystyle\left(\frac{\partial\hat{S}}{\partial\mu}\right)_{j} =2γi=1Nbeγ1vp(zv+μvxiv)2(zj+μjxij)\displaystyle=-2\gamma\sum_{i=1}^{N_{b}}e^{-\gamma\sum_{1\leq v\leq p}(z_{v}+\mu_{v}-x_{iv})^{2}}(z_{j}+\mu_{j}-x_{ij}) (38)

Since NbN_{b} is large and x^i\hat{x}_{i} is i.i.d., then the summation can be approximated with integration, i.e.,

(S^μ)j=2γxeγ1vp(zv+μvxiv)2(zj+μjxj)𝑑x.\left(\frac{\partial\hat{S}}{\partial\mu}\right)_{j}=-2\gamma\int_{x}e^{-\gamma\sum_{1\leq v\leq p}(z_{v}+\mu_{v}-x_{iv})^{2}}(z_{j}+\mu_{j}-x_{j})dx\leavevmode\nobreak\ . (39)

When γ\gamma is large, the integration is the expectation of the offset to the center of a Gaussian distribution (centered at zj+μjz_{j}+\mu_{j}). Due to symmetry, the derivative becomes 0 for most of the μ\mu locations. This would be the common cases because a smaller γ\gamma means examples with longer distance can more significantly influence the function output (which is not desired in a classifier with high accuracy).

Nonetheless, in rare cases when the γ\gamma becomes small, we can still find the stationary point of the derivative. Given the fact that eγ1vp(zv+μvxiv)2e^{-\gamma\sum_{1\leq v\leq p}(z_{v}+\mu_{v}-x_{iv})^{2}} is always positive, we know (S^μ)j=0zj+μjxj=0\left(\frac{\partial\hat{S}}{\partial\mu}\right)_{j}=0\implies z_{j}+\mu_{j}-x_{j}=0. But it is impossible to reach. Again, utilizing symmetry of the normal distribution and the fact that xx is i.i.d. uniform, we find when zj+μj𝔼[xj]<0z_{j}+\mu_{j}-\mathbb{E}[x_{j}]<0, the derivative is positive; and when zj+μj𝔼[xj]>0z_{j}+\mu_{j}-\mathbb{E}[x_{j}]>0, the derivative is negative. Since we use gradient descent to find the optimal μ^\hat{\mu}. We can conclude that, at the convergence, zj+μj=𝔼[xj]z_{j}+\mu_{j}=\mathbb{E}[x_{j}]. Then 𝔼[μj]=𝔼[xj]𝔼[zj]=0\mathbb{E}[\mu_{j}]=\mathbb{E}[x_{j}]-\mathbb{E}[z_{j}]=0 because zz is sampled from the same distribution of xx.

To summarize,

𝔼[μ^j]={0,1jpδj𝔼[zj],p<jd\mathbb{E}[\hat{\mu}_{j}]=\begin{cases}0,&1\leq j\leq p\\ \delta_{j}-\mathbb{E}[z_{j}],&p<j\leq d\end{cases} (40)

Then we know, as NbN_{b}\rightarrow\infty,

𝔼[(1m)μ]0\mathbb{E}[(1-m)\odot\mu]\rightarrow 0 (41)

\square

Appendix C Monte Carlos gradient estimation

Leveraging the Monte Carlo based estimation to craft target adversarial perturbations for the targeted label yty_{t} typically requires xx and xtx_{t}, where xtx_{t} is a legitimate sample satisfies f(xt)=ytf(x_{t})=y_{t}. Firstly, xx is forced to approach the decision boundary of f(;θ)f(\cdot;\theta) for the target label yty_{t}(ϕ(x,yt)0.5\phi(x,y_{t})\to 0.5), which stimulates the efficiency and accuracy of gradient estimation. We can make ϕ(x,yt)0.5\phi(x,y_{t})\to 0.5 through projection process:

x(1α)x+αxt,x\leftarrow(1-\alpha)x+\alpha x_{t}, (42)

α(0,1)\alpha\in(0,1) is a parameters for projection. During each gradient estimation procedure, α\alpha is set through binary search (Chen et al., 2020a).

After that, we leverage Monte Carlo sampling to estimate ϕ~x(x,yt)\widetilde{\nabla\phi}_{x}(x,y_{t}), its procedure can be expressed as:

xϕ~(x,yt)=1Ni=0N𝒮(x+δμi,yt)μi,\widetilde{\nabla_{x}\phi}(x,y_{t})=\frac{1}{N}\sum_{i=0}^{N}\mathcal{S}(x+\delta\mu_{i}^{\prime},y_{t})\mu_{i}^{\prime}, (43)

where {μ}i=0N\{\mu^{\prime}\}_{i=0}^{N} are perturbations i.i.d sampled from uniform distribution and δ\delta is a small positive value representing the magnitude of perturbation. 𝒮\mathcal{S} is an indicator function such that

𝒮(x,y)={1,f(x)=y,1,f(x)y.\mathcal{S}(x,y)=\left\{\begin{array}[]{lr}1,&f(x)=y,\\ -1,&f(x)\neq y.\par\end{array}\right. (44)

To further eliminate the variance induced by Monte Carlo sampling, we can improve ϕ~x(x,yt)\widetilde{\nabla\phi}_{x}(x,y_{t}) via:

ϕ~x(x,yt)=1Ni=0N{𝒮(x+δμi,yt)1Ni=0N𝒮(x+δμi,yt)}μi\widetilde{\nabla\phi}_{x}(x,y_{t})=\frac{1}{N}\sum_{i=0}^{N}\{\mathcal{S}(x+\delta\mu_{i}^{\prime},y_{t})-\frac{1}{N}\sum_{i=0}^{N}\mathcal{S}(x+\delta\mu_{i}^{\prime},y_{t})\}\mu_{i}^{\prime} (45)

Using cosine angle to measure the similarity between xϕ~(x,yt)\widetilde{\nabla_{x}\phi}(x,y_{t}) and xϕ(x,yt)\nabla_{x}\phi(x,y_{t}), previous work (Chen et al., 2020a) have proved that:

limδ0cos(𝔼[xϕ~(x,yt)],xϕ(x,yt))=1,\lim_{\delta\to 0}cos\angle(\mathbb{E}[\widetilde{\nabla_{x}\phi}(x,y_{t})],\nabla_{x}\phi(x,y_{t}))=1\leavevmode\nobreak\ , (46)

thus a smaller δ\delta (0.01) is selected for ensuring the efficacy of gradient estimation.

Regarding the constraints on μ1||\mu||_{1}, we further processed the estimated gradient via:

xϕ~(x,yt)xϕ~(x,yt)xϕ~(x,yt)1.\widetilde{\nabla_{x}\phi}(x,y_{t})\leftarrow\frac{\widetilde{\nabla_{x}\phi}(x,y_{t})}{||\widetilde{\nabla_{x}\phi}(x,y_{t})||_{1}}\leavevmode\nobreak\ . (47)

We refer readers to (Chen et al., 2020a) for the complete algorithm for gradient estimation.

Appendix D Algorithm for Computing Aggregated Global Adversarial Peak (GAP)

Algorithm 1 Aggregated Global Adversarial Peak (GAP)
1:Input: Targeted DNN f(;θ)f(\cdot;\theta); label to be analyzed yty_{t}; legitimate input batches {Xi}i=1n\{X_{i}\}_{i=1}^{n} evenly sampled across classes; the targeted input batch XtX_{t};
2:Output: Aggregated GAP value (yt)\mathcal{R}(y_{t});
3:Initialize (yt)=0\mathcal{R}(y_{t})=0
4:for i=1,,t1,t+1,,ni=1,\ldots,t-1,t+1,\ldots,n do
5:     for j=1,,len(Xi)j=1,\ldots,\text{len}(X_{i}) do
6:         Solve μji=argminμ(ϕ(xij+μ,yt;θ),yt)+λμ1\mu^{i}_{j}=\arg\min_{\mu}\ell(\phi(x_{ij}+\mu,y_{t};\theta),y_{t})+\lambda\|\mu\|_{1} using MC gradient estimation;
7:     end for
8:     Find the GAP value for the current label: μmaxi=maxj,u,vμj,u,vi\mu_{max}^{i}=\max_{j,u,v}\mu_{j,u,v}^{i};
9:     Aggregate GAP values over all labels: (yt)=μmaxi+(yt)\mathcal{R}(y_{t})=\mu_{max}^{i}+\mathcal{R}(y_{t});
10:end for
11:Return: (yt)\mathcal{R}(y_{t})

Appendix E The detailed configurations for experimental datasets and triggers

Task # labels Input size # training images
CIFAR-10 10 32x32 50000
CIFAR-100 100 32x32 50000
TinyImageNet 200 64x64 1000000
Table 2: Detailed information about dataset for each task.

The detailed information for each task is included in the Table. 2. The data augmentation utilized for building each model is following https://keras.io/zh/examples/cifar10_resnet/.

Appendix F The accuracy and attack success rate(ASR) for evaluated models

The accuracy and ASR for the evaluated models for each task in included in Table. 3.

Task Infected Model Normal Model Accuracy
Accuracy ASR
CIFAR-10 90.04%\geq 90.04\% 97.7%\geq 97.7\% 92.31%\geq 92.31\%
CIFAR-100 69.17%\geq 69.17\% 96.27%\geq 96.27\% 71.41%\geq 71.41\%
TinyImageNet 58.98%\geq 58.98\% 97.22%\geq 97.22\% 60.11%\geq 60.11\%
Table 3: Accuracy and ASR for the evaluated models for each task

Appendix G The amounts of evaluated models for various task and attack approaches

For each attack and task, we build 60 infected models and 60 uninfected models, respectively. Notably, the uninfected models for different attack approaches are randomly selected and different. Each uninfected and infected model sets are built evenly upon ResNet-18, ResNet-44, ResNet-56, DenseNet-33, DenseNet-58 these five models.

Appendix H Details of visualization process

Algorithm 2 Visualize μ\mu
1:Input: Targeted DNN f(;θ)f(\cdot;\theta); label to be analyzed yty_{t}; legitimate input batches {Xi}i=1n\{X_{i}\}_{i=1}^{n} evenly sampled across classes; the targeted input batch XtX_{t};
2:Initialize μ\mu
3:Output: Normalized μ\mu;
4:for i=1,,t1,t+1,,ni=1,\ldots,t-1,t+1,\ldots,n do
5:     for j=1,,len(Xi)j=1,\ldots,\text{len}(X_{i}) do
6:         Solve μji=argminμ(ϕ(xij+μ,yt;θ),yt)+λμ1\mu^{i}_{j}=\arg\min_{\mu}\ell(\phi(x_{ij}+\mu,y_{t};\theta),y_{t})+\lambda\|\mu\|_{1} using MC gradient estimation;
7:     end for
8:     Find μi\mu^{i} owns the GAP value for the current label: μi=argmaxμjiμj,u,vi\mu^{i}=\operatorname*{arg\,max}_{\mu_{j}^{i}}\mu_{j,u,v}^{i};
9:     Aggregate μi\mu^{i} over all labels: μ=μi+μ\mu=\mu^{i}+\mu;
10:end for
11:Return: μ/μ1\mu/||\mu||_{1}

Appendix I Detailed ROC for varous tasks and attack approaches

Refer to caption
(a) CIFAR-10
Refer to caption
(b) CIFAR-100
Refer to caption
(c) TinyImageNet
Figure 10: The Receiver Operating Cure(ROC) for AEVA on CIFAR-10, CIFAR-100 and TinyImageNet tasks.

Appendix J The comparison results between AEVA and NC, DL-TND

Attack Method      Detection Results
AUROC ACC
BadNets
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.976
0.999
0.930
95.0%95.0\%
99.2%99.2\%
91.5%91.5\%
Watermark
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.983
1
0.968
95.8%95.8\%
100%100\%
94.4%94.4\%
Label Consistent
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.961
0.992
0.968
94.2%94.2\%
98.3%98.3\%
94.4%94.4\%
Invisible Attack
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.980
0.999
0.970
95.8%95.8\%
99.1%99.1\%
94.4%94.4\%
Table 4: The two metrics for backdoor detection on the CIFAR-10 task using three backdoor detection methods: NC, DL-TND, and AEVA. Higher values in AUROC and ACC are better.
Attack Method      Detection Results
AUROC ACC
BadNets
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.978
0.999
0.973
95.8%95.8\%
99.1%99.1\%
95.4%95.4\%
Watermark
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.961
1
0.992
94.2%94.2\%
100%100\%
98.8%98.8\%
Label Consistent
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.964
0.999
0.981
94.2%94.2\%
99.1%99.1\%
96.8%96.8\%
Invisible Attack
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.989
1
0.984
97.5%97.5\%
100%100\%
97.1%97.1\%
Table 5: The two metrics for backdoor detection on the CIFAR-100 task using three backdoor detection methods: NC, DL-TND, and AEVA. Higher values in AUROC and ACC are better.
Attack Method      Detection Results
AUROC ACC
BadNets
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.956
0.999
0.988
91.6%91.6\%
99.1%99.1\%
97.9%97.9\%
Watermark
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.972
1
0.999
94.2%94.2\%
100%100\%
99.6%99.6\%
Label Consistent
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.983
0.999
0.983
96.7%96.7\%
99.1%99.1\%
96.7%96.7\%
Invisible Attack
NC (Wang et al., 2019)
DL-TND (Wang et al., 2020)
AEVA (Ours)
0.982
0.999
0.986
96.7%96.7\%
99.1%99.1\%
97.9%97.9\%
Table 6: The two metrics for backdoor detection on the Tiny Imagenet task using three backdoor detection methods: NC, DL-TND, and AEVA. Higher values in AUROC and ACC are better. Each approach are evaluated using 60 infected and 60 benign models.

Appendix K Experiments for different triggers

K.1 Evaluation on dynamic and sparse but not compact triggers

We evaluate AEVA for CIFAR-10 task under the dynamic and non-compact triggers scenario. We randomly generate three non-compact triggers(shown in Figure. 11) and perform backdoor attacks against a randomly selected label using these dynamic triggers. We test AEVA using 10 different DNN models, which are evenly built upon ResNet-18, ResNet-44, ResNet-56, DenseNet-33 and DenseNet-58.

Refer to caption
(a) Dynamic trigger I
Refer to caption
(b) Dynamic trigger II
Refer to caption
(c) Dynamic trigger III
Figure 11: Dynamic Triggers

AEVA successfully identify all models embedded with dynamic backdoors and the corresponding infected labels. The detailed Aggregated Global Adversarial Peak(AGAP) values and Anomaly Index given by the MAD outlier detection for infected and uninfected labels are shown in Fig. 12. We find AEVA predicts all backdoored models as infected with Anomaly Index 4\geq 4. Such results demonstrate that AEVA is resilient to dynamic and sparse but not compact triggers.

Refer to caption
(a) AGAP
Refer to caption
(b) Anamoly Index
Figure 12: AGAP and Anomaly Index for infected and uninfected labels

We also evaluate AEVA on three different triggers with different shapes, which are shown in Fig. 13. For each trigger, we build 30 models on TinyImageNet. Each set of models are evenly built upon ResNet-18, ResNet-44, ResNet-56, DenseNet-33, DenseNet-58 these architectures. The results are shown in Table. 7.

Attack Approach ACC
Trigger I 93.3%93.3\%
Trigger II 93.3%93.3\%
Trigger III 90.0%90.0\%
Table 7: Results for other different triggers
Refer to caption
(a) Trigger I
Refer to caption
(b) Trigger II
Refer to caption
(c) Trigger III
Figure 13: Triggers with different shapes

Appendix L Results for impact of the number of images per labels.

We further explore how many samples for each label are required by AEVA to be effective. We test AEVA on TinyImageNet and the experimental configurations are consistent with Sec.4.1(i.e., 240 infected and 240 benign models). Our results are illustrated in the Fig. 14, which plots the metrics against the number of samples per class. We find that our approach requires more samples (32) for each class to achieve optimal performance.

Refer to caption
Figure 14: The impact for the number of samples for each class

Appendix M The impact of the number of the labels for available images.

Since our approach aims to detect backdoors in the black-box setting, a realistic scenario is that while determining if a specific label is infected, the clean data from all other labels may be unavailable to the defender. So, in this section, we investigate the impact of the number of labels for which clean data is available for our approach. We test our approach on TinyImageNet using 60 models infected with BadNets (4×44\times 4 squares) following the configurations of Sec. 4.2. We vary the number of uninfected labels for available samples between [1, 200], randomly sampled from the 200 labels available in the TingImageNet dataset. Figure. 15 illustrates our results by plotting the metrics against the number of available labels. We observe that even with eight available labels, our approach can still detect backdoors in most cases, i.e., with merely 4%4\% of the labels being available, our approach is still viable. This study exhibits our approach’s practicality, which can work in the black-box setting and on relatively small available data. Notably, with fewer labels AEVA will always correctly tag the uninfected models. This is because few labels will cause the anomaly index low thus resilient to uninfected models.

Refer to caption
Figure 15: The impact for the number of samples for each class

Appendix N Multiple triggers within the single infected label scenarios

We also investigate the impact caused by multiple triggers within the single infected label. We here choose different 4×44\times 4 squares located at different places as the triggers implemented following BadNets. We randomly select a label as the infected label. We built 60 infected models with different architectures(i.e., ResNet-18, ResNet-44, ResNet-56, DenseNet-33, DenseNet-58). The results for TinyImageNet are shown in Table. 8. Notably, for TinyImageNet, injecting too many triggers (4\geq 4) in the single label would cause the target model’s accuracy drop (i.e., 3.7%\geq 3.7\%), which is inconsistent with the threat model for backdoor attacks (Gu et al., 2019; Chen et al., 2017; Liu et al., 2018).

As seen in Table. 8, our approach can still perform effective when injecting two 4×44\times 4 triggers in the single infected label. However, when the number of injected triggers becomes more than 2, our approach becomes less effective. This should be caused by that multiple-triggers would reduce the singularity properties for the adversarial perturbations. To address this issue, we select the sum of the largest five points as the μmax\mu_{max} instead. The results are shown in Table. 9. By selecting more points to calculate the μmax\mu_{max}, AEVA can still perform effective under the multiple triggers within the single label scenarios and have no impact on the detection accuracy for uninfected models.

Attack Approach ACC
Two infected triggers 81.7%81.7\%
Three infected triggers 48.3%48.3\%
Table 8: Results for multiple triggers within the single infected label
Attack Approach ACC
Two infected triggers 93.3%93.3\%
Three infected triggers 88.3%88.3\%
Table 9: Results II for multiple triggers within the single infected label

Appendix O Potential adaptive backdoor attacks

We here consider two potential backdoor attacks which can bypass AEVA.

O.1 Attacks with multiple target labels

Since AEVA is sensitive to the number of infected labels, the attacker can infect multiple labels with different backdoor triggers. Indeed, making multiple labels (i.e.30%\geq 30\%;) infected will make the anomaly indexes produced by the MAD detector significantly drop. However, such attack can only be successfully implemented for some small datasets which own a few labels(e.g., CIFAR-10, etc). Reported by  (Wang et al., 2019), the state-of-the-art model (DeepID) for Youtube Face dataset can not maintain the average attack success rate and model accuracy at the same time when more than 15.6%15.6\% labels are infected. In another word, when multiple labels are infected, the infected model’s accuracy is likely to get worse if the attacker wants to keep attack success rate. So too many infected labels will reduce the stealth of the attack. We conducted experiments on TinyImageNet, which reveals that when over 14 labels are infected with different triggers, the model accuracy (ResNet-44 and DenseNet-58) decreases to 57.61%\leq 57.61\% (around 3% lower than a normal model) when preserving the attack effectiveness (ASR 97.12%\geq 97.12\%) for each infected label.

O.2 Attack with large triggers

Another potential adaptive backdoor attack is that the attacker would implement a dense backdoor trigger which would alleviate the singularity phenomenon. As we claim in Section 4.3, a dense backdoor trigger would appear visually-distinguishable to the human beings which perform less stealthy, that is also reported by Neural Cleanse (Wang et al., 2019). Indeed, there exists an interesting exceptional watermark attacks. However, we empirically prove that such dense watermark attack (Chen et al., 2017) would make infected DNNs non-robust which would make the inputs sensitive to the random noise. The details are included in Appendix A.