This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AED-PADA:Improving Generalizability of Adversarial Example Detection via Principal Adversarial Domain Adaptation

Heqi Peng [email protected] Beihang UniversityBeijingChina Yunhong Wang [email protected] Beihang UniversityBeijingChina Ruijie Yang [email protected] Beihang UniversityBeijingChina Beichen Li [email protected] Beihang UniversityBeijingChina Rui Wang [email protected] Institute of Information Engineering, Chinese Academy of SciencesBeijingChina  and  Yuanfang Guo [email protected] Beihang UniversityBeijingChina
Abstract.

Adversarial example detection, which can be conveniently applied in many scenarios, is important in the area of adversarial defense. Unfortunately, existing detection methods suffer from poor generalization performance, because their training process usually relies on the examples generated from a single known adversarial attack and there exists a large discrepancy between the training and unseen testing adversarial examples. To address this issue, we propose a novel method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies the Principal Adversarial Domains (PADs), i.e., a combination of features of the adversarial examples generated by different attacks, which possesses a large portion of the entire adversarial feature space. Subsequently, we pioneer to exploit Multi-source Unsupervised Domain Adaptation in adversarial example detection, with PADs as the source domains. Experimental results demonstrate the superior generalization ability of our proposed AED-PADA. Note that this superiority is particularly achieved in challenging scenarios characterized by employing the minimal magnitude constraint for the perturbations.

Adversarial defense, adversarial example detection, generalization ability.
copyright: acmlicensedjournal: TOMMjournalyear: 2024doi: 10.1145/3706061isbn: 978-1-4503-XXXX-X/18/06ccs: Security and privacy Social aspects of security and privacyccs: Computing methodologies Computer vision

1. Introduction

Recently, Deep Neural Networks (DNNs) have been playing prominent roles in many applications. Unfortunately, considerable studies have demonstrated that DNNs can be easily deceived if certain imperceptible perturbations are introduced to their inputs (SzegedyZSBEGF13, ; NguyenYC15, ; shi2021, ; SSA, ; ILA-DA, ; a2sc, ). These perturbed inputs, also known as adversarial examples, have enforced DNNs to produce erroneous decision and become a significant security concern in safety sensitive scenarios, such as autonomous driving (attack_driving, ) and medical diagnosis (attack_medical, ).

Nowadays, adversarial training has been proved to be an effective adversarial defense strategy (FGSM, ; AD_tripletloss, ; XieTGWYL20, ; adversarial_training_tomm, ). However, it requires to possess enough knowledge about the classification models and necessitates substantial computational costs to retrain the classification models. On the contrary, adversarial example detection, a.k.a. adversarial detection, defends against adversarial attacks by distinguishing whether the inputs are benign or manipulated. The mechanism of this type of methods can be efficiently deploy to defend many applications without the requirement of extra knowledge about the core model.

Generalization ability is vital for adversarial detection methods in real-world scenarios, because these methods tend to encounter unseen attacks and they are desired to perform consistently. Current detection techniques (lid, ; md, ; steg, ; SID, ; txt_advdetection, ) typically achieve considerable generalization ability over a few conventional attacks, such as BIM (BIM, ), PGD (pgd, ), and C&W (CW, ). However, we empirically observe that these methods exhibit instability and inadequate performance against recent attacks, such as SSA (SSA, ), Jitter (jitter, ) and ILA-DA (ILA-DA, ). These methods tend to give unsatisfactory generalization performance, because their training process usually relies on a single known adversarial attack and they have not been developed from the perspective of boosting generalization ability.

In this paper, we provide a new perspective to further analyze the generalization ability of adversarial detection methods. Clearly, the threat model of all different adversarial attacks should be identical to ensure fair analysis and comparison. Thus, the features extracted by the model from different attacks are of same dimensions within a shared feature space. Due to the variations in configurations such as attack objectives, parameters, and loss functions, the features of the examples generated from each attack form a distinct domain, named Adversarial Domain (AD), as depicted in Fig. 1(a). Formally, Adversarial Domain (AD) of a particular adversarial attack is defined as the cumulative representations of all the adversarial examples, which are generated by that attack. Existing detection methods typically select a random single attack to generate the training samples, i.e., they only select one AD as the source domain, as shown in Fig. 1(b). Apparently, there are few intersections between the source and unseen target domains, which usually lead to poor generalization performance. Additionally, the randomness inherent in the selection of the source domains will induce considerable fluctuations in generalization performance.

Refer to caption
Figure 1. Schematic illustration of our proposed work. (a) represents the entire adversarial feature space as an instance, which contains 9 adversarial domains {A1,A2,,A9}\{A_{1},A_{2},\ldots,A_{9}\}. (b) represents the mechanism of existing detection methods, which usually performs the training via a single source domain to detect the examples in the unseen adversarial domain, e.g. T(T=A9)T~{}(T=A_{9}). (c) is a straightforward solution to improve generalization ability via randomly selecting multiple source domains. (d) presents the intuition behind our work. We construct PADs, which possess a larger coverage of the entire feature space, to create more potential overlaps with the target domain. The strategy is designed to significantly enhance the detection generalization ability.

A straightforward solution to the above issues, as shown in Fig. 1(c), is to randomly select multiple ADs as the source domains. This strategy tends to create a larger overlap with the target domain(s) and thereby improves the generalization performance of adversarial detection. Nonetheless, this strategy also induces uncertainty, and the selected similar ADs incur additional training costs without enough performance gains.

To further address the aforementioned problems, we propose a novel detection method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation (AED-PADA). As shown in Fig. 1(d), by selecting Principal Adversarial Domains (PADs) as the source domains, which significantly enlarges the coverage of the entire adversarial feature space and creates larger overlap with the target domain, we can offer superior generalization performance.

Specifically, AED-PADA contains two stages, i.e., Principal Adversarial Domains Identification (PADI) and Principal Adversarial Domain Adaptation (PADA). In the stage of PADI, since the discrepancies between the adversarial examples from various adversarial attacks are quite different compared to these of the ordinary classification tasks, we exploit adversarial supervised contrastive learning(Adv-SCL) to construct distinguishable ADs. Then, the selection of the most representative ADs must meet two key criteria. Firstly, there should be a clear distinction between candidate ADs to avoid redundancy caused by selecting similar ADs. Secondly, the combination of the candidate ADs should cover as much of the entire feature space as possible. To select the most representative ADs, we propose a Coverage of Entire Feature Space (CEFS) metric. With our CEFS metric, the formed PADs possess broad coverage of the entire feature space, and thus effectively improve the likelihood of capturing the location of the unseen target AD(s).

In the stage of PADA, we pioneer to exploit the mechanism of Multi-source Unsupervised Domain Adaptation (MUDA) to effectively utilize the rich knowledge acquired by PADs, to detect the unseen adversarial examples in the target domain. The framework of PADA is compatible with various existing MUDA methods. Since typical MUDAs only focus on extracting the semantic features from the spatial domain, we propose an adversarial feature enhancement module to extract features from both the spatial and frequency domains to construct a more comprehensive representation of adversarial examples.

Our major contributions can be summarized as follows.

  • We propose a novel adversarial example detection method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation, to significantly improve the generalization performance of adversarial detection.

  • We propose Principal Adversarial Domains Identification to identify the PADs, which possess a large coverage of the entire adversarial example feature space, with the help of the constructed AD clustering and proposed CEFS metric.

  • We propose Principal Adversarial Domain Adaptation for detecting adversarial examples, by exploiting adversarial feature enhancement based Multi-source Unsupervised Domain Adaptation (MUDA), which is compatible with various existing MUDA methods. To the best of our knowledge, this is the first work to exploit MUDA for adversarial example detection.

2. Related Work

2.1. Adversarial Example Detection

The majority of existing adversarial example detection methods rely on statistical features (HendrycksG17, ; magnet, ; FarAwayDis1, ; FarAwayDis2, ; Hidden4, ; Erase-and-Restore, ). They usually assume that the benign and adversarial examples originate from different distributions, and construct detectors based on the distinct statistical characteristics of these examples. Specifically, Grosse et al. (STest1, ) utilize Maximum Mean Discrepancy for the adversarial example detection. Li et al. employ Principal Component Analysis (PCA) to extract statistical feature, and construct a cascade classifier based on Support Vector Machines (SVMs) (PCAInconsistency, ). Feinman et al. (KDBU, ) carry out the detection based on Kernel Density (KD) and Bayesian-Uncertainty (BU) estimation. Ma et al. (lid, ) exploit the concept of Local Intrinsic Dimensionality (LID) to calculate the distance between the distribution of inputs and their neighbors. Lee et al. (md, ) utilize Guassian Discriminant Analysis (GDA) to model the difference between the benign and adversarial samples, and differentiate them based on Mahalanobis Distance (MD). Liu et al. (steg, ) point out that steganalysis could be applied to adversarial example detection, and propose a steganalysis-based detection method (Steg). Tian et al. (SID, ) reveal the inconsistency in the boundary fluctuations between the adversarial and benign examples, and construct Sensitivity Inconsistency Detector (SID) to identify the adversarial examples. Wang et al. (txt_advdetection, ) embed hidden-layer feature maps of DNNs into word vectors, and detect adversarial examples via Sentiment Analysis (SA).

Existing adversarial detection methods exhibit poor generalization because their training typically depends on a single known attack which vastly differs from unseen test attacks. In this paper, we propose a novel adversarial detection method, which can substantially increase the coverage of the entire adversarial feature space and create larger overlap with the test adversarial attacks.

2.2. Multi-source Unsupervised Domain Adaptation

Transfer learning (transfer_learning_survey1, ; transfer_learning_survey2, ; transfer_learning_survey3, ) is a deep learning technique which leverages knowledge acquired from the source task(s) to improve learning efficiency and performance on a related but different target task. Unsupervised domain adaptation (UDA) (unsuperviesed_domain_adaptation1, ) is a type of popular method in transfer learning which aims to migrate knowledge learned from the labeled source domain(s) to the target domain, where only unlabeled target data are available for training. Single-source Unsupervised Domain Adaptation (SUDA) (single-uda1, ; single-uda2, ; single-uda3, ) is widely explored in the previous research, which can transfer knowledge from one single source to one target domain. Compared to SUDA, Multi-source Unsupervised Domain Adaptation (MUDA) acquires richer information while introduces a new challenge, i.e., how to effectively bridge the domain gaps between all source domains and the target domain.

Various distribution alignment schemes have been proposed to achieve alignment between source and target domains. For example, Multiple Feature Spaces Adaptation Network (MFSAN) (aaai-mda, ) leverages Maximum Mean Discrepancy (MMD) to align the distributions of each pair of source and target domains in multiple specific feature spaces and aligns the outputs of classifiers by utilizing the domain-specific decision boundaries. Peng et al. (moment-mda, ) provide new theoretical insights specifically for moment matching to align the sources with each other and with the target. Owing to the development of generative adversarial networks, adversarial learning is widely used to find a domain-invariant feature space. It either focuses on approximating all combinations of pairwise domain discrepancies between each source and the target (cocktail, ; DARN, ) or uses a single domain discriminator (MIAN, ). Other explicit measures of discrepancy, such as Wasserstein distance (mdmn, ; Wasserstein, ), are also employed in MUDA to align the distribution of features. In addition to distribution alignment, the graph-matching metric (graph-muda1, ; graph-muda2, ) also considers the structural and geometric information, which achieves the alignment between the source and target domains by mapping both nodes and edges in a graph.

In this paper, we argue that the poor generalization performance of adversarial detection is due to the significant discrepancy between the source domains utilized for training and the target domain utilized for testing. Therefore, based on the MUDA approach, we propose a viable solution, Principal Adversarial Domain Adaptation, to reduce this great gap.

3. Methodology

To improve the generalization ability of adversarial example detection, we propose a novel adversarial example detection method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation (AED-PADA). The entire framework of our AED-PADA is shown in Fig. 2. AED-PADA contains two stages, Principal Adversarial Domains Identification (PADI) and Principal Adversarial Domain Adaptation (PADA). In the stage of PADI, we first incorporate adversarial supervised contrastive learning(Adv-SCL) to acquire distinguishable ADs. Then, we construct AD clustering to group ADs into different clusters. By proposing the Coverage of Entire Feature Space (CEFS) metric, we select the most representative ADs from each cluster to form PADs. In the stage of PADA, we propose an adversarial feature enhancement method based on the original MUDA method to effectively leverage PADs to detect the unseen adversarial attack methods. Note that Secs. 3.2, 3.3, 3.4 and 3.5 introduce PADI while Sec. 3.6 presents PADA.

3.1. Notations

Suppose we have a labeled CC-class classification dataset 𝔻={(xi,yi)}i=1N\mathbb{D}=\{(x_{i},y_{i})\}_{i=1}^{N} with NN samples, where the label y{1,2,,C}y\in\{1,2,\ldots,C\}. A classifier ff is trained on 𝔻\mathbb{D} to classify an input sample into one of the CC classes, f(x)Cf(x)\rightarrow\mathbb{Z}_{C}. Adversarial attack ψ\psi aims to fool ff into assigning incorrect labels via generating adversarial examples. We have a set of adversarial attack methods, Ψ={ψm}m=1M\Psi=\{\psi^{m}\}_{m=1}^{M}, which consists of MM distinct adversarial attack methods. 𝔻m={(xim,yim)}i=1N\mathbb{D}^{m}=\{(x_{i}^{m},y_{i}^{m})\}_{i=1}^{N} is the adversarial dataset generated by the mm-th adversarial attack method ψm\psi^{m}, where ximx_{i}^{m} denotes the ii-th adversarial example generated by the mm-th adversarial attack ψm\psi^{m}, and yimy_{i}^{m} presents its corresponding prediction label. We define φ\varphi as, φ(xim)=ψm\varphi(x_{i}^{m})=\psi^{m}, which is a mapping from an adversarial example to its attack method. Consequently, we construct a set of adversarial examples 𝒟\mathcal{D} based on 𝔻\mathbb{D}, which comprises MM types of adversarial examples, 𝒟={𝔻m}m=1M\mathcal{D}=\{\mathbb{D}^{m}\}_{m=1}^{M}.

Refer to caption
Figure 2. Our AED-PADA framework contains two stages: (a) Principal Adversarial Domains Identification, which consists of Adversarial Domain Acquisition, Adversarial Domain Clustering and Principal Adversarial Domains Selection, and (b) Principal Adversarial Domain Adaptation.

3.2. Adversarial Domain Acquisition

Typically, when we extract the features of the adversarial examples, which are generated from different untargeted attacks via common CNNs, the features tend to spread in an indistinguishable manner in the feature space. To acquire distinguishable representations of adversarial examples from different attacks, we exploit adversarial supervised contrastive learning (Adv-SCL) to extract features. Then, we form the Adversarial Domains (ADs), each of which is defined as the representations of all the adversarial examples generated from a particular adversarial attack.

Based on the supervised contrastive learning (SCL) method (KhoslaTWSTIMLK20, ), Adv-SCL neglects the classification result of the adversarial examples and focuses sorely on identifying their generation methods. Specifically, each adversarial example ximx_{i}^{m} in 𝒟\mathcal{D} is characterized by the method used to generate it, i.e., the adversarial attack method ψm\psi^{m}. The ψm\psi^{m} of the adversarial example ximx_{i}^{m} serves as a key discriminant for determining whether different ximx_{i}^{m} are positive or negative samples. Here, a pair of examples from the same attack are considered as positive, while those from different attacks are considered as negative. This learning strategy amplifies the dissimilarities across examples from various attacks and generates a more appropriate representation.

As shown in Fig. 2(a), the input of AD acquisition is the adversarial example set 𝒟\mathcal{D}. Adv-SCL consists of an Encoder Network (EncEnc) and a Projection Network (ProjProj). Enc()Enc(\cdot) extracts a feature vector from the input adversarial example ximx_{i}^{m}, and Proj()Proj(\cdot) further projects this representation vector to an auxiliary vector zi=Proj(Enc(xim))z_{i}=Proj(Enc(x_{i}^{m})), which will be discarded after training.

(transforming1, ; transform2, ) investigate that transformation strategies, such as cropping and rescaling, bit-depth reduction, JPEG compression, and randomization, have been used to defend against adversarial examples. Therefore, in order to prevent any potential ineffectiveness of adversarial samples caused by transformations, we only use normalization (NormNorm) as the data augmentation operation in the stage of AD Acquisition.

We randomly sample nn example-label pairs in 𝒟\mathcal{D}, {(xkmk,ykmk)}k=1n\{(x_{k}^{m_{k}},y_{k}^{m_{k}})\}_{k=1}^{n}, where mkm_{k} is the adversarial attack of the kk-th adversarial example. The corresponding batch employed for training consists of 2n2n pairs, {(x~l,y~l)}l=12n\{(\tilde{x}_{l},\tilde{y}_{l})\}_{l=1}^{2n}, where x~2k\tilde{x}_{2k} and x~2k1\tilde{x}_{2k-1} are two views of xkmkx_{k}^{m_{k}}, and they are from the same adversarial attack method ψmk\psi^{m_{k}}. The loss function of Adv-SCL is defined as,

(1) sup=iIisup,\mathcal{L}^{sup}=\sum_{i\in I}\mathcal{L}_{i}^{sup},
(2) isup=1|P(i)|pP(i)logexp(zizp/τ)aA(i)exp(ziza/τ).\mathcal{L}_{i}^{sup}=\frac{-1}{|P(i)|}\sum_{p\in P(i)}\log\frac{\exp(z_{i}\cdot z_{p}/\tau)}{\sum_{a\in A(i)}\exp(z_{i}\cdot z_{a}/\tau)}.

Here, iI={1,,2n}i\in I=\{1,\ldots,2n\} denotes the index of the augmented samples. A(i)A(i) is a subset of II which includes all indexes except ii. P(i)={pA(i):φ(x~p)=φ(x~i)}P(i)=\{p\in A(i):\varphi(\tilde{x}_{p})=\varphi(\tilde{x}_{i})\} represents the indices of positive samples in the batch except ii, and |P(i)||P(i)| stands for its cardinality. \cdot denotes the dot product. τ\tau is the temperature parameter which scales the similarity values.

3.3. Adversarial Domain Clustering

After the acquisition of ADs, we obtain m=Enc(𝔻m)\mathbb{H}^{m}=Enc(\mathbb{D}^{m}) as AD of the corresponding adversarial attack ψm\psi^{m}. For 𝒟\mathcal{D}, we construct a set of ADs, ={m}m=1M\mathcal{H}=\{\mathbb{H}^{m}\}_{m=1}^{M}. Since different ADs tend to distribute differently in the feature space and many of them possess various portions of overlaps, it is quite difficult to directly select the most representative ADs from scratch. Then, it is vital to explore the similarities among different ADs.

To address this issue, we intend to perform the selection via two steps, i.e., clustering to assess the similarities of ADs and selecting the most representative ADs from each cluster. Then, we construct a viable solution named Adversarial Domain Clustering, as shown in Fig. 2(a). This strategy groups ADs into different clusters, by ensuring that the similarity among ADs within the same cluster is maximized, while the similarity among ADs across different clusters is minimized. AD clustering avoids the redundancy and additional costs by preventing the repeated selection of similar attack methods, and it facilitates the selection of the most representative ADs in the subsequent steps.

Since the samples to be clustered here are collections of features, rather than individual data points, traditional clustering methods such as K-Means (kmeans, ) cannot be directly applied. Since spectral clustering (NgJW01, ) only requires the similarity matrix among samples, it is utilized to construct the clustering step in our AD clustering.

For the estimation of the similarity matrix WRM×MW\in R^{M\times M} in spectral clustering, which represents the similarities between different ADs, we propose Adversarial Domain Similarity Measurement (ADSM) based on Jensen-Shannon divergence (JSD) (ErvenH14, ), which quantifies the similarity between two probability distributions. To compute the similarities, we transform each i\mathbb{H}^{i} in \mathcal{H} to i¯\overline{\mathbb{H}^{i}} by converting i\mathbb{H}^{i} into probabilities and performing normalizations. Since a smaller value of JSD between two ADs implies a smaller discrepancy in their probability distributions, ADSM(i,j)\text{ADSM}(\mathbb{H}^{i},\mathbb{H}^{j}) can be computed via

(3) ADSM(i,j)=1JSD(i¯,j¯).\text{ADSM}(\mathbb{H}^{i},\mathbb{H}^{j})=\frac{1}{\text{JSD}(\overline{\mathbb{H}^{i}},\overline{\mathbb{H}^{j}})}.

By letting the element Wi,jW_{i,j} at the ii-th row and jj-th column refer to the similarity between i\mathbb{H}^{i} and j\mathbb{H}^{j}, Wi,jW_{i,j} can be calculated by

(4) Wi,j={ADSM(i,j)ij0i=j.W_{i,j}=\left\{\begin{array}[]{cc}\text{ADSM}(\mathbb{H}^{i},\mathbb{H}^{j})&i\neq j\\ 0&i=j\end{array}.\right.

Since the spectral clustering cannot automatically determine the optimal number of clusters, the Calinski-Harabasz score (CH score) (mann2023proposed, ), which requires no knowledge of the cluster shape, is utilized to evaluate the clustering performance and estimate the optimal number of clusters. Note that it measures both the within-cluster and between-cluster distances, thereby offering a more comprehensive view of the clustering performance. CH score is calculated by

(5) CH(K)=Tr(BCK)/(K1)Tr(WCK)/(NK),\text{CH}(K)=\frac{Tr(BC_{K})/(K-1)}{Tr(WC_{K})/(N-K)},

where KK and NN is the number of clusters and data respectively, BCKBC_{K} denotes between-cluster covariance matrix, WCKWC_{K} denotes within-cluster covariance matrix, and Tr()Tr(\cdot) denotes the trace of the matrix.

Higher value of CH(K)\text{CH}(K) indicates better clustering. We posit that the inherent structure of the entire feature space composed of different ADs is highly complex. Given that the CH score often awards the highest evaluation when cluster numbers K=2K=2, we choose to commence our consideration from the scenario where K=3K=3 in this paper. With the help of the CH score, our AD clustering can automatically group the ADs into optimal number of clusters.

3.4. Principal Adversarial Domains Selection

After similar ADs are clustered, then the selection step can be performed. To form the most effective Principal Adversarial Domains (PADs), it is vital to select appropriate ADs from different clusters. Thus, we propose the Coverage of Entire Feature Space metric (CEFS) to guide the PADs selection process.

CEFS is a ratio-based metric which contains two aspects, Intra-Domain Dispersion (IDD) and Discrepancy between Adversarial Domains (DAD). IDD represents the dispersion among features within each AD, where a higher value indicates a larger coverage of the feature space. For any AD, i={h1i,h2i,,hvi}Rv×d\mathbb{H}^{i}=\{{h_{1}^{i}},{h_{2}^{i}},\ldots,{h_{v}^{i}}\}\in R^{v\times d}, where vv and dd denote the number and dimension of features in i\mathbb{H}^{i}, respectively, IDD can be computed by

(6) IDD(x)=1vi=1vxixi21vj=1vxjxj22.\text{IDD}(x)=\frac{1}{v}\sum_{i=1}^{v}\left\|\frac{x_{i}}{\|x_{i}\|_{2}}-\frac{1}{v}\sum_{j=1}^{v}\frac{x_{j}}{\|x_{j}\|_{2}}\right\|_{2}.

DAD represents the discrepancies between the two selected ADs, this discrepancy can be quantified using a distance metric Dist(,)Dist(\cdot,\cdot), such as Kullback-Leibler divergence (kl-divergence, ) or Maximum Mean Discrepancy (MMD) (mmd, ). A lower DAD value indicates a greater similarity between the two selected ADs. DAD can be calculated as,

(7) DAD(i,j)=Dist(i¯,j¯).\text{DAD}(\mathbb{H}^{i},\mathbb{H}^{j})=Dist(\overline{\mathbb{H}^{i}},\overline{\mathbb{H}^{j}}).

Then, CEFS can be obtained via

(8) CEFS=i=1KIDD(i)DAD((1K),(1M)),\text{CEFS}=\frac{\sum_{i=1}^{K}\text{IDD}(\mathbb{H}^{i})}{\text{DAD}((\mathbb{H}^{1}\parallel\cdots\parallel\mathbb{H}^{K}),(\mathbb{H}^{1}\parallel\cdots\parallel\mathbb{H}^{M}))},

where KK and MM is the number of ADs in PADs and the number of ADs in the AD set \mathcal{H} respectively, and \parallel denotes a concatenation operation.

CEFS is a ratio-based metric to quantify the coverage of selected ADs within the entire feature space (EFS). In Eq. (8), the numerator represents Intra-Domain Dispersion (IDD), indicates feature dispersion within each AD. The denominator measures the discrepancy between the selected ADs and EFS. As CEFS increases, the numerator grows, indicating a larger feature space for each AD, and the denominator decreases, suggesting a greater similarity between the selected ADs and the EFS. Consequently, the selected ADs have a larger coverage of the EFS, increasing the likelihood of capturing unseen ADs. We utilize CEFS to select PADs, which can give a larger coverage of the feature space with the same number of ADs. PADs actually enhance the probability of capturing the location of unseen ADs, thereby improving the generalization performance.

3.5. Training process of Principal Adversarial Domains Identification

The process of Principal Adversarial Domains Identification (PADI) consists of Adversarial Domain Acquisition (AD Acquisition), Adversarial Domain Clustering (AD Clustering) and Principal Adversarial Domains Selection (PADs Selection). The training process of Principal Adversarial Domains Identification is described in Algorithm 1.

Algorithm 1 Training process of Principal Adversarial Domains Identification
1:𝒟acq,𝒟clu\mathcal{D}_{acq},\mathcal{D}_{clu}.
2:Principal Adversarial Domains (PADs).
3:\triangleright Adversarial Domain Acquisition
4:for ximix_{i}^{m_{i}} 𝐢𝐧\mathbf{in} 𝒟acq\mathcal{D}_{acq} do
5:     Augment ximix_{i}^{m_{i}} to be (x~2i1,x~2i)(\tilde{x}_{2i-1},\tilde{x}_{2i}) with augmentation NormNorm
6:     zi=Enc(Proj(x~i))z_{i}=Enc(Proj(\tilde{x}_{i}))
7:     Get 𝐄𝐧𝐜\mathbf{Enc}, 𝐏𝐫𝐨𝐣\mathbf{Proj} via minimizing \mathcal{L} in Eq. (1)
8:end for
9:\triangleright Adversarial Domain Clustering
10:Generate the AD pool clu\mathcal{H}_{clu} based on 𝒟clu\mathcal{D}_{clu}
11:Generate the similarity matrix WW by Eq. (3)
12:for  kk 𝐢𝐧\mathbf{in} [3,MMdo
13:     rk=SC(W,k)r_{k}=SC(W,k) \triangleright rkr_{k} is the clustering result via Spectral clustering.
14:     Get CH(k)\text{CH}(k) by Eq. (5)
15:end for
16:K=argmaxk(CH(k))K=\operatorname*{arg\,max}_{k}(\text{CH}(k)) \triangleright KK is the best number of clusters.
17:rK=SC(W,K)={clui}i=1Kr_{K}=SC(W,K)=\{clu_{i}\}_{i=1}^{K} \triangleright rKr_{K} is the best clustering result which contains KK clusters.
18:\triangleright Principal Adversarial Domains Selection
19:C=Πi=1KniC=\Pi_{i=1}^{K}n_{i} \triangleright nin_{i} is the number of adversarial attacks in cluiclu_{i}, and Σi=1Kni=M.\Sigma_{i=1}^{K}n_{i}=M.
20:Generate combination list L={li}i=1CL=\{l_{i}\}_{i=1}^{C} \triangleright LL is the list of all combinations in rKr_{K}, each LiL_{i} has KK types of attacks.
21:for lil_{i} 𝐢𝐧\mathbf{in} LL  do
22:     Generate i={ij}j=1K\mathcal{H}_{i}=\{\mathbb{H}_{i}^{j}\}_{j=1}^{K} based on lil_{i}\triangleright i\mathcal{H}_{i} is the list of ADs based on lil_{i}
23:     Get CEFS(i\mathcal{H}_{i}) by Eq. (8)
24:end for
25:PADs=argmaxli(CEFS(i))\text{PADs}=\operatorname*{arg\,max}_{l_{i}}(\text{CEFS}(\mathcal{H}_{i}))

For the original dataset 𝔻\mathbb{D}, we divide it into two non-overlapping datasets 𝔻acq\mathbb{D}_{acq} and 𝔻clu\mathbb{D}_{clu}, which are used for AD Acquisition and AD Clustering, respectively. The adversarial examples sets for AD Acquisition are denoted as 𝒟acq\mathcal{D}_{acq}. 𝒟acq={𝔻acqm}m=1M,𝔻acqm={(xim,yim)}i=1Nm\mathcal{D}_{acq}=\{\mathbb{D}_{acq}^{m}\}_{m=1}^{M},\mathbb{D}_{acq}^{m}=\{(x_{i}^{m},y_{i}^{m})\}_{i=1}^{N_{m}}, where MM is the number of adversarial attacks, 𝔻acqm\mathbb{D}_{acq}^{m} is the adversarial examples set generated by the mm-th adversarial attack, and NmN_{m} is the cardinality of 𝔻acqm\mathbb{D}_{acq}^{m}.

Likewise, The adversarial examples sets for AD Clustering are denoted as 𝒟clu\mathcal{D}_{clu}. 𝒟clu={𝔻cluq}q=1M\mathcal{D}_{clu}=\{\mathbb{D}_{clu}^{q}\}_{q=1}^{M}, 𝔻cluq={(xiq,yiq)}i=1Nq\mathbb{D}_{clu}^{q}=\{(x_{i}^{q},y_{i}^{q})\}_{i=1}^{N_{q}}, where MM is the number of adversarial attacks, 𝔻cluq\mathbb{D}_{clu}^{q} is the adversarial examples set generated by the qq-th adversarial attack, and NqN_{q} is the cardinality of 𝔻cluq\mathbb{D}_{clu}^{q}. The rationale behind this strategy is to prevent the model after contrastive learning from overfitting to the data on AD acquisition, which leads to subpar performance in AD Clustering and PAD Confirming.

3.6. Principal Adversarial Domain Adaptation

To transfer the learned knowledge from PADs to the target domain, i.e., the adversarial examples generated from unseen methods, we propose Principal Adversarial Domain Adaptation (PADA) to detect adversarial examples, as depicted in Fig. 2(b).

Ideally, the inputs of PADA comprise source data and unseen target data. The source data, denoted as, Src={Srci}i=1KSrc=\{Src_{i}\}_{i=1}^{K}, consists of an equal number of benign examples and adversarial examples, where the adversarial examples contain KK types of adversarial examples determined by PADs. Since the unseen target data is unavailable, we can only use the training data as a proxy. The proxy data, denoted as PDPD, also contains an equal number of benign examples and adversarial examples, drawn from the training benign set 𝔻\mathbb{D} and the training adversarial set 𝒟\mathcal{D}, respectively. 𝒟\mathcal{D} contains MM types of adversarial examples. During the PADI stage, we select the most representative ADs, i.e., KK types of ADs from these MM types to form the PADs. Consequently, employing 𝒟\mathcal{D} as a proxy for unseen target data serves two key purposes. Firstly, it prevents overlap between the source data used for training and the unseen target data used for testing. Secondly, the PADA process compels the transfer of specific knowledge from PADs to the more extensive set 𝒟\mathcal{D}. This transfer aims to boost the generalization capabilities of networks of PADA, with the goal of enhancing their performance when testing the unseen target data.

PADA consists of three sequential components, feature extraction, feature alignment and classification. The framework of PADA is compatible with various widely used Multi-source Unsupervised Domain Adaptation (MUDA) methods (moment-mda, ; DARN, ; mdmn, ; aaai-mda, ). The experimental results indicate that our PADA possesses excellent generalization capabilities based on various existing MUDA methods. Due to the simplicity and effectiveness of MFSAN (aaai-mda, ), along with its superior detection performance compared to other MUDA methods, we select MFSAN as the basic MUDA method for our PADA.

Refer to caption
Figure 3. The three kernels of PEF employed to capture the perturbation signals hidden in the adversarial inputs.

In the feature extraction component, unfortunately, existing MUDA methods including MFSAN, only extract spatial features. (GeirhosRMBWB19, ; WangWHX20, ; FanLCZG21, ) indicate that the high-frequency component of an image plays a crucial role in the prediction of deep neural network. Adversarial perturbations are more likely to be concealed in the high-frequency information of images. To capture more comprehensive features of adversarial examples, we propose an adversarial feature enhancement (AFE) module as the feature extraction component. AFE contains both the spatial feature extraction 𝒮\mathcal{S} and frequency feature extraction \mathcal{F} branches. For frequency feature extraction, we design perturbations extraction filters (PEF), which is a plug-and-play operation based on Spatial Rich Model (SRM) (SRM, ) to capture subtle perturbation signals hidden in the adversarial examples.

SRM typically uses 30 basic kernels to capture textures and discontinuities of images, and has been employed in image forensics to detect subtle and irregular manipulation or hidden information. Subsequent research (SRM_detection, ) indicates that in image manipulation detection, employing only three kernels can achieve considerable detection performance, and more many basic kernels does not further enhance performance. Essentially, these kernels are high-pass filters, which enhance the high-frequency signals and remove the low-frequency components of the inputs. Adversarial perturbations are typically hidden within the high-frequency information of an image. Therefore, our PEF uses the same three kernels to extract subtle perturbation signals. As shown in Fig. 3, PEF consists of three filters which are implemented by convolution kernels with fixed parameters. We set the kernel size of PEF to be 5×5×35\times 5\times 3, and the output channel size of PEF is 3. Both 𝒮\mathcal{S} and \mathcal{F} employ EncEnc from Sec. 3.2, aligned with the threat models, specifically ResNet-18 or VGG-16. Then, the enhanced adversarial features, denoted as r(x)=[𝒮(x),(x)]r(x)=[\mathcal{S}(x),\mathcal{F}(x)], are fed into the next module.

In the alignment component, we assign the specific network QiQ_{i} to map each source domain SrciSrc_{i} and proxy domain PDPD to the different feature space, and utilizes MMD as the distance metric to align them. LdL_{d} is used to align the features between source domain and proxy domain,

(9) Ld=1Ki=1KMMD(Qi(r(Srci)),Qi(r(PD))),L_{d}=\frac{1}{K}\sum_{i=1}^{K}\text{MMD}(Q_{i}(r(Src_{i})),Q_{i}(r(PD))),

where rr is the enhanced adversarial features after AFE, KK is the number of source domains. each source domain is associated with the corresponding network QiQ_{i}.

In the classification component, we utilize the cross-entropy loss LclsL_{cls} to ensure the correct classification. Given that different domain-specific classifiers CiC_{i} are trained on their respective source domains, resulting in significant discrepancies among their predictions for the same proxy domain. To address this, LdiscL_{disc} is used to minimize the differences among the predictions of various classifiers.

(10) Ldisc=\displaystyle L_{disc}= 2K×(K1)j=1K1i=j+1K𝔼xPD|Ci(Qi(r(x)))Cj(Qj(r(x)))|.\displaystyle\frac{2}{K\times(K-1)}\sum_{j=1}^{K-1}\sum_{i=j+1}^{K}\mathbb{E}_{x\sim PD}|C_{i}(Q_{i}(r(x)))-C_{j}(Q_{j}(r(x)))|.

Overall, the total loss is formulated as follow, λ\lambda and γ\gamma are hyperparameters used to adjust the weights of LdL_{d} and LdiscL_{disc}, respectively.

(11) Ltotal=Lcls+λLd+γLdisc.L_{total}=L_{cls}+\lambda L_{d}+\gamma L_{disc}.

4. Experiments

4.1. Experimental Setups

4.1.1. DNN backbones

We evaluate the performance of the proposed method, by employing two widely used DNN architectures, i.e., ResNet-18  (HeZRS16, ) and VGG-16 (SimonyanZ14a, ), according to (SID, ).

4.1.2. Datasets

We evaluate the performance of the proposed method on three popular datasets, CIFAR-10 (cifar10, ), SVHN (SVHN, ) and ImageNet (imagenet, ). All the images in ImageNet are resized to 224×224×3224\times 224\times 3 via pre-processing.

As shown in Table 1, each of the three datasets is divided into three category-balanced and non-overlapping subsets: Train-acq-src, Train-clu-pro and Test. The training data for AD Acquisition in the PADI stage and the source data in the PADA stage are selected from Train-acq-src. Similarly, the training data for AD Clustering in the PADI stage and the proxy data in the PADA stage are selected from Train-clu-pro.

For CIFAR-10, we divide the official CIFAR-10 training set into two halves, each of which contains 25,000 images, to form Train-acq-src and Train-clu-pro. Besides, we form Test with all the 10,000 images in the official CIFAR-10 testing set. For SVHN, we set the number of Train-acq-src and Train-clu-pro to 20,000 instead of 25,000, due to the presence of category imbalance in the official SVHN training dataset, to construct a category-balanced training dataset. Test of SVHN also consists of 10,000 images, which are randomly selected from the official SVHN testing set. For ImageNet, We divide the official ImageNet (ILSVRC2012) validation set, which consists of 50,000 images, into two parts, 40,000 images for training and 10,000 images for Test. The 40,000 training images are then equally divided into two subsets, Train-acq-src and Train-clu-pro, each of which contains 20,000 images.

Ten types of training adversarial examples are created based on the benign examples from the Train-acq-src and Train-clu-pro subsets. Subsequently, another seven attacks are applied to generate the testing adversarial examples on the Test set. These specific adversarial attack methods are detailed in Sec. 4.1.4. Note that the attack methods for training and testing are entirely different, ensuring that the training data and testing data are mutually exclusive.

4.1.3. Data splitting strategy for the training stage

Table 2 presents the data splitting strategy for the training stage of our AED-PADA. AED-PADA contains two stages during training. In the PADI stage, to improve the efficiency, we randomly select 10,000 images from each type of adversarial attacks and their corresponding benign images from the Train-acq-src for supervised contrastive learning. To alleviate the overfitting problem, we select 10,000 images per attack from Train-clu-pro for AD Clustering and PAD Confirming.

In the PADA stage, the inputs of PADA are the images from the source domains and the unseen target domains. Each source domain contains 10,000 samples, with an equal split of 5,000 benign examples and 5,000 adversarial examples, which are all randomly selected from Train-acq-src. Since the data from unseen domains is unavailable, we can only utilize the training examples as a proxy, which are generated from all the 10 types of training attacks. Specifically, the proxy domain contains 10,000 samples, i.e., 5,000 benign examples and 5,000 adversarial examples, which are all randomly selected from the Train-clu-pro. Note that the adversarial examples in the proxy domain are obtained via 10 different training attacks, with each attack providing 500 examples. This specific arrangement has two benefits. Firstly, it ensures zero overlap between the source and proxy domains during training, to avoid data leakage. Secondly, it enhances the diversity of the training data, to further benefit the generalization of the detection model.

Table 1. Datasets for training and testing.
Dataset Train-acq-src Train-clu-pro Test
benign adv benign adv benign adv
CIFAR-10 25,00025,000 10×25,00010\times 25,000 25,00025,000 10×25,00010\times 25,000 10,00010,000 7×10,0007\times 10,000
SVHN 20,00020,000 10×20,00010\times 20,000 20,00020,000 10×20,00010\times 20,000 10,00010,000 7×10,0007\times 10,000
ImageNet 20,00020,000 10×20,00010\times 20,000 20,00020,000 10×20,00010\times 20,000 10,00010,000 7×10,0007\times 10,000
Table 2. Data splitting for the training stage of our AED-PADA.
Training stage benign examples adversarial examples
PADI AD Acquisition - 10,00010,000/attack
AD Clustering - 10,00010,000/attack
PADA source domain 5,0005,000/source 5,0005,000/source
proxy domain 5,0005,000 5,0005,000

4.1.4. Baseline adversarial attack methods

To evaluate the generalization capabilities of the detection methods, it is important to consider a diverse set of attack methods, including both the earlier and recent techniques. Here, 10 earlier attack methods are utilized to generate adversarial examples for training, including FGSM (FGSM, ), BIM (BIM, ), C&W (CW, ), DeepFool (DeepFool, ), PGD (pgd, ), MI-FGSM (MIM, ), DIM(DIM, ), ILA (ILA, ), YA-ILA (ILA++, ) and SI-NI-FGSM (si-ni-fgsm, ). 7 SOTA attack methods are employed to generate adversarial examples for testing, including APGD (APGD, ), ILA-DA (ILA-DA, ), Jitter (jitter, ), SSA (SSA, ), TI-FGSM (TIM, ), VMI-FGSM (vmifgsm, ) and VNI-FGSM (vmifgsm, ). The adversarial attacks for training and testing are entirely distinct, so the adversarial detection results on the unseen testing attacks indicate the generalization performance of our proposed detection method.

4.1.5. Baseline adversarial detection methods

We compare the generalizability of our AED-PADA with five state-of-the-art adversarial detection methods, LID (lid, ), MD (md, ), Steg (steg, ), SID (SID, ) and SA (txt_advdetection, ). These detection methods differ from ours as they employ only a single attack method for training. To ensure a fair comparison with our AED-PADA, it is necessary to consider the scenario training by multiple attacks. Consequently, we consider the following two configurations for the SOTA detection methods: (1) For single attack training, we utilize all data from Train-acq-src to ensure sufficient data volume for effective training. (2) For multiple attacks training, the methods are trained by the adversarial examples in PADs, which are consistent with our AED-PADA. Each source domain contains 10,000 samples from Train-acq-src, evenly divided into 5,000 benign and 5,000 adversarial examples.

Table 3. Comparison of the generalization performances on CIFAR-10 between the state-of-the-art adversarial detection methods and our AED-PADA. Averaged Accuracy is the average result across seven unseen state-of-the-art testing adversarial attacks. The bolded and the underlined values represent the best and the second best results for each column, respectively.
Dataset (Backbone) Detector Accuracy on Unseen SOTA Adversarial Attacks (%) Averaged Accuracy (%)
APGD ILA-DA Jitter SSA TI-FGSM VMI-FGSM VNI-FGSM
CIFAR-10 (ResNet-18) LID (lid, ) 90.044 94.121 78.240 59.839 90.863 91.056 86.496 84.380
LID-PADs 78.903 86.068 70.045 57.330 82.278 82.333 79.978 76.705
MD (md, ) 65.976 96.657 61.764 54.225 69.282 69.653 69.726 69.612
MD-PADs 63.371 97.970 62.058 51.501 68.100 68.814 69.351 68.738
Steg (steg, ) 83.521 94.692 86.759 58.064 90.496 90.814 92.057 85.200
Steg-PADs 84.085 94.155 88.885 62.305 90.715 90.860 91.690 86.099
SID (SID, ) 86.113 60.999 74.629 53.732 87.323 87.330 82.100 76.032
SID-PADs 86.520 61.820 74.860 54.010 88.285 88.220 82.935 76.664
SA (txt_advdetection, ) 84.907 90.731 87.365 81.323 89.275 90.013 91.967 87.940
SA-PADs 89.125 93.485 91.925 88.885 94.760 94.875 95.595 92.664
AED-PADA 90.545 97.855 97.065 84.255 97.700 97.765 97.675 94.694
CIFAR-10 (VGG-16) LID (lid, ) 85.024 93.357 75.092 58.639 88.201 88.173 81.540 81.432
LID-PADs 70.493 89.403 65.190 58.588 75.553 75.408 73.500 72.590
MD (md, ) 51.322 72.013 50.904 52.012 51.809 51.799 50.963 54.403
MD-PADs 50.033 81.779 50.028 51.349 50.005 50.856 51.349 55.057
Steg (steg, ) 79.280 92.756 88.466 57.470 88.854 88.891 89.390 83.586
Steg-PADs 79.900 92.265 88.705 59.360 89.015 88.800 89.235 83.897
SID (SID, ) 83.235 66.047 73.260 53.373 84.743 84.582 76.210 74.493
SID-PADs 83.105 66.240 73.230 53.375 84.605 84.385 76.145 74.441
SA (txt_advdetection, ) 80.198 85.036 81.003 63.984 85.356 86.293 87.627 81.357
SA-PADs 79.720 83.825 77.985 63.335 83.765 84.475 85.755 79.873
AED-PADA 84.370 94.365 93.260 65.765 93.135 93.410 93.545 88.264
Table 4. Comparison of the generalization performances on SVHN between the state-of-the-art adversarial detection methods and our AED-PADA. Averaged Accuracy is the average result across seven unseen state-of-the-art testing adversarial attacks. The bolded and the underlined values represent the best and the second best results for each column, respectively.
Dataset (Backbone) Detector Accuracy on Unseen SOTA Adversarial Attacks (%) Averaged Accuracy (%)
APGD ILA-DA Jitter SSA TI-FGSM VMI-FGSM VNI-FGSM
SVHN (ResNet-18) LID (lid, ) 66.431 87.940 63.697 60.324 71.403 71.384 64.952 69.447
LID-PADs 61.849 86.846 60.551 58.515 69.581 69.639 64.182 67.309
MD (md, ) 61.167 66.439 59.906 58.270 67.625 67.609 60.870 63.127
MD-PADs 60.768 50.190 59.637 57.967 67.257 67.337 60.688 60.549
Steg (steg, ) 71.591 51.509 94.268 59.865 96.943 96.874 95.954 81.000
Steg-PADs 73.460 50.860 97.115 74.360 98.005 98.040 97.575 84.202
SID (SID, ) 68.859 92.207 63.763 56.571 74.074 73.995 64.150 70.517
SID-PADs 69.670 93.420 64.155 56.800 75.285 75.120 64.780 71.319
SA (txt_advdetection, ) 68.092 73.066 65.288 69.276 82.758 82.823 82.467 74.824
SA-PADs 69.490 73.100 63.610 66.550 78.155 78.220 77.785 72.416
AED-PADA 74.455 99.730 99.730 87.305 99.730 99.730 99.730 94.344
SVHN (VGG-16) LID (lid, ) 65.645 96.605 62.731 64.252 74.205 74.263 70.296 72.571
LID-PADs 69.710 97.154 63.626 63.955 76.258 76.306 70.119 73.875
MD (md, ) 51.511 58.333 51.233 51.859 51.870 52.268 51.465 52.648
MD-PADs 51.920 56.484 51.578 52.745 51.103 50.285 51.748 52.266
Steg (steg, ) 70.586 98.661 93.882 58.496 95.774 95.575 94.549 86.789
Steg-PADs 72.925 98.975 97.700 58.905 98.485 98.480 98.085 89.079
SID (SID, ) 68.159 69.813 61.320 60.722 74.468 74.454 67.213 68.021
SID-PADs 68.365 62.450 61.145 56.790 70.075 70.080 63.470 64.625
SA (txt_advdetection, ) 65.862 72.558 63.304 67.745 72.087 72.183 73.811 69.650
SA-PADs 65.060 69.765 59.365 62.605 66.795 66.955 67.385 65.419
AED-PADA 73.500 99.220 98.995 68.990 99.295 99.310 99.220 91.219
Table 5. Comparison of the generalization performances on ImageNet between the state-of-the-art adversarial detection methods and our AED-PADA. Averaged Accuracy is the average result across seven unseen state-of-the-art testing adversarial attacks. The bolded and the underlined values represent the best and the second best results for each column, respectively.
Dataset (Backbone) Detector Accuracy on Unseen SOTA Adversarial Attacks (%) Averaged Accuracy (%)
APGD ILA-DA Jitter SSA TI-FGSM VMI-FGSM VNI-FGSM
ImageNet (ResNet-18) LID (lid, ) 52.868 54.078 50.991 53.263 53.793 56.566 56.549 54.015
LID-PADs 57.340 63.910 55.195 58.150 58.890 60.900 61.245 59.376
MD (md, ) 53.760 52.031 51.728 51.603 53.533 59.438 58.536 54.376
MD-PADs 54.560 50.425 50.685 52.765 54.695 62.165 61.040 55.191
Steg (steg, ) 78.415 86.954 86.553 85.431 65.101 94.679 94.606 84.534
Steg-PADs 90.725 94.735 94.700 94.535 89.660 94.875 94.875 93.444
SID (SID, ) 51.885 56.475 51.745 51.425 55.285 55.335 54.995 53.878
SID-PADs 55.070 56.975 53.785 52.335 59.790 59.860 58.400 56.602
SA (txt_advdetection, ) 85.738 87.149 85.912 85.712 86.960 85.988 86.014 86.210
SA-PADs 66.955 69.230 67.665 66.930 67.385 68.490 68.550 67.886
AED-PADA 98.830 99.935 99.965 99.965 99.070 99.965 99.960 99.670
ImageNet (VGG-16) LID (lid, ) 53.144 53.672 52.029 51.694 52.056 57.081 56.173 53.693
LID-PADs 56.295 59.330 55.110 54.825 55.695 59.960 59.310 57.218
MD (md, ) 53.760 52.031 51.728 51.603 53.533 59.438 58.536 54.376
MD-PADs 53.470 50.705 51.960 50.970 52.430 57.385 56.585 53.358
Steg (steg, ) 78.931 90.915 85.852 84.279 66.300 94.695 94.969 85.134
Steg-PADs 90.585 94.185 94.115 94.165 90.375 94.530 94.535 93.213
SID (SID, ) 41.718 50.624 50.477 44.102 51.015 51.567 51.528 48.719
SID-PADs 43.238 46.165 53.794 50.002 49.125 53.257 46.382 48.852
SA (txt_advdetection, ) 87.747 93.025 89.939 88.176 86.256 92.133 92.255 89.933
SA-PADs 91.590 94.910 93.415 92.310 91.280 94.250 94.280 93.148
AED-PADA 98.960 99.930 99.935 99.890 98.225 99.935 99.930 99.544

4.1.6. Implementation details

In this paper, adversarial examples are generated by the untargeted white-box attacks with the ll_{\infty} norm constraint. The perturbation of the training adversarial examples is constrained to a challenging scenario, where the maximum magnitude of the adversarial perturbation is set to 2. The step size and number of iterations for adversarial attacks are set to 1/255 and 10, respectively. During PADA stage, we utilize MFSAN (aaai-mda, ) as the basic MUDA method and employ the same training strategy as it. All the adversarial example detection methods are trained consistently for 100 epochs. To evaluate the performance of our proposed method, the widely used Accuracy is employed as the metric for the adversarial example detection task. We utilize MFSAN as the basic MUDA method in the PADA stage of our framework and set the trade-off parameters λ=γ=1.0\lambda=\gamma=1.0, which respectively control the importance of LdL_{d} and LdiscL_{disc}. The experiments are conducted on an NVIDIA GeForce RTX 3080Ti GPU.

4.2. Performance Evaluations

Here, we compare our method with 5 SOTA detection methods, including LID (lid, ), MD (md, ), Steg (steg, ), SID (SID, ) and SA (txt_advdetection, ).

4.2.1. Generalization performances against unseen adversarial attacks

Tables 3, 4, and 5 present the cross-attack detection results across 7 unseen testing adversarial attacks on CIFAR-10, SVHN, and ImageNet, respectively. Existing adversarial detection methods, which named without ‘-PADs’, are trained on a single attack with their original settings. Each of their results is the averaged detection accuracy with 10 training attacks being utilized one after another in the training process. On the contrary, those with ‘-PADs’ are trained on multiple attacks to be fairly compared to our AED-PADA.

These results provide compelling evidence that our approach obviously outperforms these state-of-the-art adversarial detection methods on the generalization ability. Note that the performance of our approach (99.544%) surpasses that of SID (48.719%) by a significant margin, i.e., 50.825%, under the settings of ImageNet and VGG-16. Besides, our superiority is particularly achieved in challenging scenarios, where the maximum magnitude of the adversarial perturbation is merely set to 2. However, in previous studies, the maximum magnitude is typically set to 4 or 8. The reduction in the magnitude of perturbations significantly increases the difficulty of detecting adversarial examples and reduces the detection performance. The experimental results in Tables 3, 4, and 5 demonstrate that our AED-PADA is effective in the scenarios characterized by subtle adversarial perturbations.

As can be observed, our AED-PADA demonstrates better generalization ability on ImageNet compared to that on CIFAR-10 and SVHN, because the image resolution in ImageNet (224×224224\times 224) is higher than that in CIFAR-10 and SVHN (32×3232\times 32). Intuitively, a larger image tends to provide a larger space for adversarial perturbation generation, which makes the differences between the perturbations obtained from various attack methods more pronounced, i.e., reducing the overlaps between different adversarial attacks in the adversarial feature space. By selecting PADs, our AED-PADA is able to occupy a larger feature space, thus achieving a better generalization ability.

Besides, PADs may not be suitable to be directly applied to the existing methods. As can be observed, if the SOTA methods directly adopt PADs for training (denoted as X-PADs), their performances may not always increase. For instance, the generalization performance of LID-PADs on CIFAR-10 is inferior to that of LID. We postulate that this discrepancy can be attributed partially to the reduction in the volume of the training data, and partially to the incompatibility between PADs and the framework of LID. This observation further verifies the effectiveness of our PADA in our AED-PADA framework.

4.2.2. Generalization performances across different backbones and datasets

To thoroughly evaluate the generalization ability of adversarial detection methods, it is also vital to assess their performances in the scenarios where the training and testing adversarial examples are from different backbones and datasets. A detection method with better cross-attack, cross-backbone, and cross-dataset capabilities tends to be better suited for practical applications in complex environments.

Table 6 displays the generalization performances of different detection methods across different backbones and datasets. The cross-backbone results are obtained on CIFAR-10, and the cross-dataset results are obtained when the backbone is ResNet-18. In general, results from both scenarios demonstrate that our method exhibits the best generalization ability. Since the structures tend to vary significantly among different types of backbones, which will result in different feature dimensions between training and testing, both LID and MD are not applicable in cross-backbone experiment, i.e., they lack the capability to generalize across backbones, because they utilize the intermediate features of the backbones for detection.

Table 6. Comparison of generalization performances across different backbones and datasets. The bolded and the underlined values represent the best and the second best results for each column, respectively.
Detector Averaged Accuracy (%)
Cross-Backbone Cross-Dataset
ResNet-18\rightarrow VGG-16 VGG-16\rightarrow ResNet-18 CIFAR-10\rightarrow SVHN SVHN\rightarrow CIFAR-10
LID (lid, ) - - 71.619 80.233
MD (md, ) - - 62.437 61.429
Steg (steg, ) 82.894 85.157 53.249 78.155
SID (SID, ) 74.169 76.458 68.928 73.442
SA (txt_advdetection, ) 80.751 87.773 58.910 51.694
Ours 89.301 91.624 82.964 85.113
Table 7. Comparison of generalization performances across different maximum perturbation magnitudes on CIFAR-10 when the backbone is ResNet-18. The maximum magnitude of the testing perturbation is set to 1, 2, 4, 8, respectively.
Detector Known Unseen
2 1 4 8 avg
LID [13] 84.4 56.2 71.0 76.1 67.8
MD [14] 69.6 57.1 66.2 66.4 63.2
Steg [15] 85.2 67.8 90.0 90.4 82.7
SID [16] 76.0 55.2 68.0 64.6 62.6
SA [17] 87.9 75.7 89.8 88.9 84.8
Ours 94.7 89.9 97.0 95.2 94.0
Table 8. Comparison of generalization performances across different maximum perturbation magnitudes on SVHN when the backbone is ResNet-18. The maximum magnitude of the testing perturbation is set to 1, 2, 4, 8, respectively.
Detector Known Unseen
2 1 4 8 avg
LID [13] 69.4 55.0 69.9 75.0 66.6
MD [14] 63.1 55.4 63.4 65.2 61.3
Steg [15] 81.0 58.3 94.7 95.4 82.8
SID [16] 70.5 51.9 64.8 65.4 60.7
SA [17] 74.8 55.7 83.9 84.1 74.6
Ours 93.3 89.1 98.6 99.5 95.7

4.2.3. Generalization performances across different maximum perturbation magnitudes

Here, we evaluate the robustness of the detection models by focusing on a challenging scenario where the maximum perturbation magnitude of the adversarial examples to be detected is unknown. All the detection models are trained on adversarial examples with a maximum perturbation magnitude of 2, while the maximum magnitude of the testing adversarial perturbations is varied across 1, 2, 4 and 8, respectively. Tables 8 and 8 present the generalization performances across different maximum perturbation magnitudes on CIFAR-10 and SVHN, respectively. Based on these results, our AED-PADA consistently outperforms the state-of-the-art adversarial detection methods across all the testing scenarios, demonstrating superior generalization ability of our method against the adversarial perturbations with unseen perturbation magnitudes.

4.3. Effectiveness of AD clustering and PADs Selection

Selecting multiple adversarial attacks from the same cluster leads to redundancy, as these candidate attacks are quite similar. The combination of them covers a small feature space, causing poor generalization. On the other hand, choosing from different clusters effectively avoids this issue. Consequently, as shown in Table 9 and Table 10, we set up two groups of experiments, i.e., selecting ADs from Same Cluster and Cross Cluster, to verify the effectiveness of the AD Clustering and PADs Selection of AED-PADA, against the 7 unseen attacks. ‘Same Cluster’ and ‘Cross Cluster’ respectively represent that the ADs employed as the sources domains are selected from the same AD cluster and different AD clusters.

Table 9 presents the generalization performances on CIFAR-10 and ResNet-18. The result of AD clustering is {FGSM, PGD, DIM, MI-FGSM, SI-NI-FGSM}-{BIM, ILA, YA-ILA}-{C&W, DeepFool}. For ‘Same Cluster’, the best, worst and mean values of averaged results are 90.444%, 93.494%, and 92.150%, respectively. For ‘Cross Cluster’, the best, worst, and mean values are 94.694%, 92.429%, and 93.892%, respectively. Similarly, Table 10 shows the results on CIFAR-10 and VGG-16. The result of AD clustering is {BIM, ILA, YA-ILA, PGD}-{DIM, MI-FGSM, FGSM}-{CW, DeepFool, SI-NI-FGSM}. For ‘Same Cluster’, the best, worst and mean values of the averaged results are 87.132%, 88.019%, and 87.461%, respectively. For ‘Cross Cluster’, the best, worst, and mean values of the averaged values are 88.264%, 87.406%, and 87.818%, respectively.

Table 9. Detection Results of different source domains on CIFAR-10 when the backbone is ResNet-18.
Origin Detector Averaged Accuracy (%) CEFS(\uparrow)
Same Cluster FGSM+SI-NI-FGSM+PGD 90.444 -
BIM+ILA+YA-ILA 92.513 -
MI-FGSM+SI-NI-FGSM+DIM 93.494 -
Cross Cluster DIM+BIM+C&W 94.694 80.856
MI-FGSM+BIM+C&W 94.621 72.625
SI-NI-FGSM+BIM+C&W 94.624 58.649
FGSM+BIM+C&W 94.600 58.543
PGD+BIM+C&W 94.222 50.669
FGSM+BIM+DeepFool 93.407 50.142
FGSM+ILA+DeepFool 92.539 49.818
PGD+ILA+DeepFool 92.429 48.015
Table 10. Detection Results of different source domains on CIFAR-10 when the backbone is VGG-16.
Origin Detector Averaged Accuracy (%) CEFS(\uparrow)
Same Cluster BIM+ILA+YA-ILA 87.389 -
BIM+ILA+PGD 87.153 -
BIM+YA-ILA+PGD 87.132 -
ILA+YA-ILA+PGD 87.439 -
DIM+MIM+FGSM 87.634 -
CW+DeepFool+SI-NI-FGSM 88.019 -
Cross Cluster ILA+DIM+CW 88.264 18.692
PGD+DIM+CW 88.236 18.605
ILA+FGSM+CW 87.969 18.532
YA-ILA+FGSM+CW 87.897 18.527
YA-ILA+MI-FGSM+SI-NI-FGSM 87.670 18.518
BIM+DIM+SI-NI-FGSM 87.676 18.410
YA-ILA+FGSM+SI-NI-FGSM 87.425 18.388
YA-ILA+DIM+SI-NI-FGSM 87.406 18.385

Based on the results, we can obtain three observations. Firstly, the mean results of ‘Cross Cluster’ selection is higher than of ‘Same Cluster’ selection, and the best result of ‘Cross Cluster’ selection is also superior. Specially, On the setting of CIFAR-10 and ResNet-18, the mean results of ‘Cross Cluster’ selection is even higher than of the best results of ‘Same Cluster’ selection. This verifies the effectiveness of our AD Clustering. Apparently, source domains from ‘Cross Cluster’ can certainly enhance the generalization ability of the detection methods.

Secondly, although ‘Cross Cluster’ selection in general achieves better results, it cannot guarantee that the randomly selected ADs (from ‘Cross Cluster’s) always give better performance than the results of ‘Same Cluster’ selected ADs. When the PADs with a very low CEFS score are employed as the source domains for detection, the generalization performance may be worse than that of ‘Same Cluster’ selection. This observation further verifies the effectiveness of our PADs Selection.

Thirdly, as shown in the ‘Cross Cluster’ part of table 9 and table 10, for the majority of results, a higher CEFS value of PADs induces a higher averaged accuracy of the proposed detection. This observation suggests that the proposed CEFS is a proper guide for selecting PADs, i.e., a higher CEFS score indicates that the corresponding PADs can cover a larger proportion of the entire feature space, and the PADA stage, which uses the PADs as source domains, gives a better generalization performance.

4.4. The effect of different numbers of clusters

Here, we apply AD Clustering to 10 adversarial attack methods in the training set and present the impact of different cluster numbers KK (the number of adversarial attacks in PADs). As depicted in Table 11, the detection performance increase when KK increases, until a certain value of KK is achieved. This trend is attributed to the utilization of an increased number of ADs as the source domains, which undeniably provides larger coverage of the entire feature space. Note that the values with {\dagger} represent the generalization performance of our AED-PADA, in which the number of clusters is automatically determined based on the CH score, and the bolded values represent the best generalization performance.

As can be observed, our automatic CH score based method can yield superior performance on SVHN with VGG-16 being the backbone, and give comparable performance to the best result on other settings. Although the best result with manually selecting KK exceeds our automatic method for 0.2%-0.3% in terms of the averaged accuracy, it requires 2.3×2.3\times more parameters and gives a 3×3\times slower speed. Moreover, in real-world scenarios, the adversarial attacks in the training set may far exceed 10 types, and it is clearly unwise to test the detection performance of each clustering result individually. Consequently, this suggests that our method actually achieves a better balance between the training costs and generalization performances.

Table 11. Detection Performance across seven unseen testing adversarial attacks with different number of clusters KK.
Dataset (Backbone) Averaged Accuracy (%)
K=2K=2 K=3K=3 K=4K=4 K=5K=5 K=6K=6 K=7K=7
CIFAR-10 (ResNet-18) 91.936 94.69494.694^{{\dagger}} 94.784 94.571 94.561 95.018
CIFAR-10 (VGG-16) 87.946 88.26488.264^{{\dagger}} 87.964 87.991 88.104 88.498
SVHN (ResNet-18) 91.497 94.34494.344^{{\dagger}} 91.257 91.726 94.999 94.801
SVHN (VGG-16) 89.183 89.269 91.219\textbf{91.219}^{{\dagger}} 90.056 90.026 90.713
Table 12. Detection Performance of different feature extractions in PADA of our AED-PADA.
Dataset Feature Extraction Averaged Accuracy (%) Average
ResNet-18 VGG-16
CIFAR-10 spatial 93.411 88.251 90.831
freq 93.379 86.755 90.067
spatial + freq 94.694 88.264 91.479
SVHN spatial 94.612 89.721 92.166
freq 88.157 91.580 89.869
spatial + freq 94.344 91.219 92.781

4.5. Effectiveness of the adversarial feature enhancement in PADA

In this experiment, we utilize three distinct feature extractions, i.e., spatial feature extraction, frequency feature extraction, and the adversarial feature enhancement (spatial + freq) in our framework. Still 7 test unseen attacks are employed and the averaged results are reported in Table 12. As can be observed, in both datasets, the average results of our adversarial feature enhancement(spatial + freq) exhibit superior performances than both the spatial feature extraction and frequency feature extraction. This indicates that using adversarial feature enhancement as the feature extraction method for PADA is effective. Furthermore, whether it is spatial feature extraction or frequency feature extraction, the performance of single domain feature extraction across different backbones is inconsistent, as evidenced by the fluctuated averaged detection performance across different backbones on SVHN. Our adversarial feature enhancement, which combines both spatial and frequency aspects, can more comprehensively capture adversarial perturbation signals, yielding superior adversarial detection performance.

4.6. The compatibility with different MUDA methods

To demonstrate the compatibility of our framework with the existing Multi-source Unsupervised Domain Adaptation (MUDA) methods, we employ four widely used Multi-source Unsupervised Domain Adaptation (MUDA) methods, M3DA (moment-mda, ), DARN (DARN, ), MDMN (mdmn, ) and MFSAN (aaai-mda, ). All four MUDA methods have utilized the adversarial feature enhancement as the feature extraction component. Table 13 presents that the performances of employing all four MUDA methods can surpass the existing SOTA adversarial detection methods. This verifies that our framework possesses excellent compatibility with existing MUDA methods. In other words, The first exploration of MUDA methods in adversarial example detection has proven to be successful, and Principal Adversarial Domain Adaptation can effectively transfer the knowledge from PADs to the unseen target domain. Besides, since the average result of MFSAN outperforms other MUDA methods, we select MFSAN as the basic MUDA component in our PADA.

Table 13. Detection performance of our framework with different MUDA methods.
Dataset Backbone Averaged Accuracy of different MUDA methods (%)
M3DA (moment-mda, ) DARN (DARN, ) MDMN (mdmn, ) MFSAN (Ours) (aaai-mda, )
CIFAR-10 ResNet-18 94.976 93.948 93.832 94.694
CIFAR-10 VGG-16 87.819 87.624 91.463 88.864
SVHN ResNet-18 93.892 93.934 93.888 94.344
SVHN VGG-16 90.960 92.244 89.826 91.219
Average 91.912 91.938 92.252 92.280

4.7. Parameters sensitivity.

We utilize MFSAN as the basic MUDA method in the PADA stage of our framework, and λ=γ=1.0\lambda=\gamma=1.0 control the importance of LdL_{d} and LdiscL_{disc}, respectively. To study the parameters sensitivity, we sample the values of λ\lambda and γ\gamma from {0.5, 1.0, 2.0}, and perform the experiments under the settings of CIFAR-10 and ResNet-18.

Table 14. Sensitivity of the parameters λ\lambda and γ\gamma in the PADA stage.
λ\lambda γ\gamma Averaged Accuracy (%)
λ\lambda=0.5 γ\gamma=0.5 94.503
λ\lambda=0.5 γ\gamma=1.0 94.591
λ\lambda=0.5 γ\gamma=2.0 94.355
λ\lambda=1.0 γ\gamma=0.5 94.646
λ\lambda=1.0 γ\gamma=1.0 94.694
λ\lambda=1.0 γ\gamma=2.0 94.603
λ\lambda=2.0 γ\gamma=0.5 94.598
λ\lambda=2.0 γ\gamma=1.0 94.629
λ\lambda=2.0 γ\gamma=2.0 94.223

Table 14 indicates that our proposed method maintains high performance consistency under various parameter settings when employing MFSAN as the basic MUDA method. The best averaged accuracy is 94.694% when λ=γ=1.0\lambda=\gamma=1.0. Variations of the parameters typically do not induce the fluctuations in the detection performance. The mean value and the standard deviation of all the averaged accuracies are 94.538% and 0.154%, respectively. This indicates that our PADA framework is effective and robust, since it shows insensitivity to different parameters and consistently offers high detection accuracy.

4.8. The computational costs

Table 15. Comparison of the time and hardware costs. Batch size for each method is 1000.
Cost Detector
LID (lid, ) MD (md, ) Steg (steg, ) SID (SID, ) SA (txt_advdetection, ) Ours
Time (ms/image) 8.13 18.05 23.85 2.16 1.21 0.60
Hardware (MB) 7458 7248 1614 1592 8564 6124

Table 15 presents the time and hardware costs when deploying the state-of-the-art adversarial example detection methods and our AED-PADA. The results indicate that our method has the minimal time expenses and moderate hardware (GPU memory) expenses under the same settings. All experiments are conducted on an NVIDIA GeForce RTX 3080Ti GPU under CIFAR-10 and ResNet-18 settings.

Furthermore, our AED-PADA is capable of deployment in real-world scenarios. Real-world scenarios demand the adversarial example detection methods not only possess the capability for real-time detection but also maintain robust detection performance in the environments where the adversarial attacks, datasets, and backbones during testing are entirely unseen. Firstly, we only train once, ready for real-world deployment without retraining for new unseen attacks. We have the shortest inference time, which is the best real-time detection performance. Secondly, we train from earlier attacks and test on more advanced attacks. Table 3 demonstrates that under this real-world-aligned setting, our AED-PADA exhibits superior generalization performance, and would maintain considerable detection capabilities against future unseen attacks. Lastly, with its strong performance across various backbones and datasets depicted in the Table 6, our proposed method performs better in the environments with unseen datasets and backbones. Consequently, our AED-PADA is well-suited for practical applications in the complex real-world environments.

5. Conclusion

In this paper, we proposed a novel and effective adversarial example detection method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation, to improve the generalization ability of adversarial detection. Specifically, AED-PADA contains two stages, i.e., Principal Adversarial Domains Identification (PADI) and Principal Adversarial Domain Adaptation (PADA). In PADI, we acquired ADs from scratch and constructed PADs as the source domains for PADA. In PADA, we proposed an adversarial feature enhancement based Multi-source Unsupervised Domain Adaptation framework, which is compatible with various existing MUDA methods, to effectively leverage PADs to achieve adversarial example detection. Experimental results demonstrated the superiority of our work, compared to the state-of-the-art detection methods.

Acknowledgements.
This work was supported in part by the National Natural Science Foundation of China under Grant 62272020, U20B2069 and 62176253, in part by the State Key Laboratory of Complex & Critical Software Environment under Grant SKLSDE2023ZX-16, and in part by the Fundamental Research Funds for Central Universities.

References

  • [1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • [2] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
  • [3] Mengte Shi, Sheng Li, Zhaoxia Yin, Xinpeng Zhang, and Zhenxing Qian. On generating jpeg adversarial images. In International Conference on Multimedia and Expo, pages 1–6, 2021.
  • [4] Cheng Luo, Qinliang Lin, Weicheng Xie, Bizhu Wu, Jinheng Xie, and Linlin Shen. Frequency-driven imperceptible adversarial attack on semantic similarity. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15294–15303, 2022.
  • [5] Chiu Wai Yan, Tsz-Him Cheung, and Dit-Yan Yeung. ILA-DA: improving transferability of intermediate level attack with data augmentation. In International Conference on Learning Representations, 2023.
  • [6] Yikun Xu, Xingxing Wei, Pengwen Dai, and Xiaochun Cao. A2sc: Adversarial attacks on subspace clustering. ACM Transactions on Multimedia Computing, Communications and Applications, 19(6):1–23, 2023.
  • [7] Prasanth Buddareddygari, Travis Zhang, Yezhou Yang, and Yi Ren. Targeted attack on deep rl-based autonomous driving with learned visual patterns. In International Conference on Robotics and Automation, pages 10571–10577, 2022.
  • [8] Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition, 110:107332, 2021.
  • [9] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • [10] Pengcheng Li, Jinfeng Yi, Bowen Zhou, and Lijun Zhang. Improving the robustness of deep neural networks via adversarial training with triplet loss. In International Joint Conference on Artificial Intelligence, pages 2909–2915, 2019.
  • [11] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L. Yuille, and Quoc V. Le. Adversarial examples improve image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 816–825, 2020.
  • [12] Shangxi Wu, Jitao Sang, Kaiyan Xu, Guanhua Zheng, and Changsheng Xu. Adaptive adversarial logits pairing. ACM Transactions on Multimedia Computing, Communications and Applications, 20(2):56:1–56:16, 2024.
  • [13] Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.
  • [14] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
  • [15] Jiayang Liu, Weiming Zhang, Yiwei Zhang, Dongdong Hou, Yujia Liu, Hongyue Zha, and Nenghai Yu. Detection based defense against adversarial examples from the steganalysis point of view. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4825–4834, 2019.
  • [16] Jinyu Tian, Jiantao Zhou, Yuanman Li, and Jia Duan. Detecting adversarial examples from sensitivity inconsistency of spatial-transform domain. In AAAI Conference on Artificial Intelligence, pages 9877–9885, 2021.
  • [17] Yulong Wang, Tianxiang Li, Shenghong Li, Xin Yuan, and Wei Ni. New adversarial image detection based on sentiment analysis. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2023.
  • [18] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations Workshop, 2017.
  • [19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • [20] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In ACM Workshop on Artificial Intelligence and Security, pages 3–14, 2017.
  • [21] Leo Schwinn, René Raab, An Nguyen, Dario Zanca, and Bjoern Eskofier. Exploring misclassifications of robust neural networks to enhance adversarial attacks. Applied Intelligence, pages 1–17, 2023.
  • [22] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
  • [23] Dongyu Meng and Hao Chen. Magnet: A two-pronged defense against adversarial examples. In ACM SIGSAC Conference on Computer and Communications Security, pages 135–147, 2017.
  • [24] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging Generative Models to Understand and Defend Against Adversarial Examples. In International Conference on Learning Representations, 2018.
  • [25] Jonathan Aigrain and Marcin Detyniecki. Detecting adversarial examples and other misclassifications in neural networks by introspection. In International Conference on Learning Representations, 2019.
  • [26] Philip Sperl, Ching-Yu Kao, Peng Chen, and Konstantin Böttinger. Dla: Dense-layer-analysis for adversarial example detection. In IEEE European Symposium on Security and Privacy, pages 198–215, 2019.
  • [27] Fei Zuo and Qiang Zeng. Exploiting the sensitivity of L2 adversarial examples to erase-and-restore. In ACM Asia Conference on Computer and Communications Security, pages 40–51, 2021.
  • [28] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (Statistical) Detection of Adversarial Examples. arXiv preprint arXiv:1702.06280, 2017.
  • [29] Xin Li and Fuxin Li. Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics. In IEEE/CVF International Conference on Computer Vision, pages 5775–5783, 2017.
  • [30] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting Adversarial Samples from Artifacts. arXiv preprint arXiv:1703.00410, 2017.
  • [31] Liang Zhao, Zhikui Chen, Laurence T. Yang, M. Jamal Deen, and Z. Jane Wang. Deep semantic mapping for heterogeneous multimedia transfer learning using co-occurrence data. ACM Transactions on Multimedia Computing, Communications, and Applications, 15(1s):9:1–9:21, 2019.
  • [32] Yang Yang, Yi Yang, and Heng Tao Shen. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications, 9(2):14:1–14:20, 2013.
  • [33] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In International Conference on Machine Learning, volume 27, pages 17–36, 2012.
  • [34] Feng Liu, Guangquan Zhang, and Jie Lu. Heterogeneous domain adaptation: An unsupervised approach. IEEE Transactions on Neural Networks and Learning Systems, 31(12):5588–5602, 2020.
  • [35] Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using feature-whitening and consensus loss. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9471–9480, 2019.
  • [36] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):801–814, 2019.
  • [37] Minghao Chen, Shuai Zhao, Haifeng Liu, and Deng Cai. Adversarial-learned loss for domain adaptation. In AAAI Conference on Artificial Intelligence, pages 3521–3528, 2020.
  • [38] Yongchun Zhu, Fuzhen Zhuang, and Deqing Wang. Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources. In AAAI Conference on Artificial Intelligence, pages 5989–5996, 2019.
  • [39] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In IEEE/CVF International Conference on Computer Vision, pages 1406–1415, 2019.
  • [40] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3964–3973, 2018.
  • [41] Junfeng Wen, Russell Greiner, and Dale Schuurmans. Domain aggregation networks for multi-source domain adaptation. In International Conference on Machine Learning, pages 10214–10224, 2020.
  • [42] Geon Yeong Park and Sang Wan Lee. Information-theoretic regularization for multi-source domain adaptation. In IEEE/CVF International Conference on Computer Vision, pages 9194–9203, 2021.
  • [43] Yitong Li, michael Murias, geraldine Dawson, and David E Carlson. Extracting relationships by multi-domain matching. In Advances in Neural Information Processing Systems, 2018.
  • [44] Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. Multi-source distilling domain adaptation. In AAAI Conference on Artificial Intelligence, volume 34, pages 12975–12983, 2020.
  • [45] Baoyao Yang and Pong C Yuen. Cross-domain visual representations via unsupervised graph alignment. In AAAI Conference on Artificial Intelligence, volume 33, pages 5613–5620, 2019.
  • [46] Hang Wang, Minghao Xu, Bingbing Ni, and Wenjun Zhang. Learning to combine: Knowledge aggregation for multi-source domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), volume 12353, pages 727–744, 2020.
  • [47] Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot a nd Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, pages 18661–18673, 2020.
  • [48] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. In International Conference on Learning Representations, 2018.
  • [49] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan L. Yuille. Mitigating adversarial effects through randomization. In International Conference on Learning Representations, 2018.
  • [50] MacQueen and James. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297, 1967.
  • [51] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2001.
  • [52] Tim van Erven and Peter Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
  • [53] Supreet Kaur Mann and Sonal Chawla. A proposed hybrid clustering algorithm using k-means and birch for cluster based cab recommender system (cbcrs). International Journal of Information Technology, 15(1):219–227, 2023.
  • [54] Solomon Kullback and Richard A Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86, 1951.
  • [55] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13:723–773, 2012.
  • [56] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
  • [57] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P. Xing. High-frequency component helps explain the generalization of convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8681–8691, 2020.
  • [58] Lijie Fan, Sijia Liu, Pin-Yu Chen, Gaoyuan Zhang, and Chuang Gan. When does contrastive learning preserve adversarial robustness from pretraining to finetuning? In Advances in Neural Information Processing Systems, pages 21480–21492, 2021.
  • [59] Jessica Fridrich and Jan Kodovsky. Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3):868–882, 2012.
  • [60] Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. Davis. Learning rich features for image manipulation detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [62] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [63] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [64] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • [65] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • [66] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A Simple and Accurate Method to Fool Deep Neural Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
  • [67] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9185–9193, 2018.
  • [68] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. Improving transferability of adversarial examples with input diversity. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2019.
  • [69] Qian Huang, Isay Katsman, Zeqi Gu, Horace He, Serge J. Belongie, and Ser-Nam Lim. Enhancing adversarial example transferability with an intermediate level attack. In IEEE/CVF International Conference on Computer Vision, pages 4732–4741, 2019.
  • [70] Qizhang Li, Yiwen Guo, and Hao Chen. Yet another intermediate-level attack. In Proceedings of the European Conference on Computer Vision (ECCV), pages 241–257, 2020.
  • [71] Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E. Hopcroft. Nesterov accelerated gradient and scale invariance for adversarial attacks. In International Conference on Learning Representations, 2020.
  • [72] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, pages 2206–2216, 2020.
  • [73] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4312–4321, 2019.
  • [74] Xiaosen Wang and Kun He. Enhancing the transferability of adversarial attacks through variance tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1924–1933, 2021.