This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Removing Undesirable Feature Contributions Using Out-of-Distribution Data

Saehyung Lee, Changhwa Park, Hyungyu Lee, Jihun Yi, Jonghyun Lee, Sungroh Yoon
Electrical and Computer Engineering, AIIS, ASRI, INMC, and Institute of Engineering Research
Seoul National University
Seoul 08826, South Korea
{halo8218,omega6464,rucy74,t080205,leejh9611,sryoon}@snu.ac.kr
Correspondence to: Sungroh Yoon [email protected].
Abstract

Several data augmentation methods deploy unlabeled-in-distribution (UID) data to bridge the gap between the training and inference of neural networks. However, these methods have clear limitations in terms of availability of UID data and dependence of algorithms on pseudo-labels. Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues. We show how to improve generalization theoretically using OOD data in each learning scenario and complement our theoretical analysis with experiments on CIFAR-10, CIFAR-100, and a subset of ImageNet. The results indicate that undesirable features are shared even among image data that seem to have little correlation from a human point of view. We also present the advantages of the proposed method through comparison with other data augmentation methods, which can be used in the absence of UID data. Furthermore, we demonstrate that the proposed method can further improve the existing state-of-the-art adversarial training.

1 Introduction

The power of the enormous amount of data suggested by the empirical risk minimization (ERM) principle (Vapnik & Vapnik, 1998) has allowed deep neural networks (DNNs) to perform outstandingly on many tasks, including computer vision (Krizhevsky et al., 2012) and natural language processing (Hinton et al., 2012). However, most of the practical problems encountered by DNNs have high-dimensional input spaces, and nontrivial generalization errors arise owing to the curse of dimensionality (Bellman, 1961). Moreover, neural networks have been found to be easily deceived by adversarial perturbations with a high degree of confidence (Szegedy et al., 2013). Several studies (Goodfellow et al., 2014; Krizhevsky et al., 2012) have been conducted to address these generalization problems resulting from ERM. Most of them handled the generalization problems by extending the training distribution (Madry et al., 2017; Lee et al., 2020). Nevertheless, it has been demonstrated that more data are needed to achieve better generalization (Schmidt et al., 2018). Recent methods (Carmon et al., 2019; Xie et al., 2019) introduced unlabeled-in-distribution (UID) data to compensate for the lack of training samples. However, there are limitations associated with these methods. First, obtaining suitable UID data for selected classes is challenging. Second, when applying supervised learning methods on pseudo-labeled data, the effect of data augmentation depends heavily on the accuracy of the pseudo-label generator.

In our study, in order to break through the limitations outlined above, we propose an approach that promotes robust and standard generalization using out-of-distribution (OOD) data. Especially, motivated by previous studies demonstrating the existence of common adversarial space among different images or even datasets (Naseer et al., 2019; Poursaeed et al., 2018), we show that OOD data can be leveraged for adversarial learning. Likewise, if the OOD data share the same undesirable features as those of the in-distribution data in terms of standard generalization, they can be leveraged for standard learning. By definition, in this work, the classes of the OOD data differ from those of the in-distribution data, and our method do not use the label information of the OOD data. Therefore the proposed method is free from the previously mentioned problems caused by UID data. We present a theoretical model which demonstrates how to improve generalization using OOD data in both adversarial and standard learning. In our theoretical model, we separate desirable and undesirable features and show how the training on OOD data, which shares undesirable features with in-distribution data, changes the weight values of the classifier. Based on the theoretical analysis, we introduce out-of-distribution data augmented training (OAT), which assigns a uniform distribution label to all the OOD data samples to remove the influence of undesirable features in adversarial and standard learning. In the proposed method, each batch is composed of training data and OOD data, and OOD data regularize the training so that only features that are strongly correlated with class labels are learned. We complement our theoretical findings with experiments on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and a subset of ImageNet (Deng et al., 2009). In addition, we present the empirical evidence for the transferability of undesirable features through further studies on various datasets including Simpson Characters (Attia, 2018), Fashion Product Images (Aggarwal, 2018), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2017), and VisDA-17 (Peng et al., 2017).

Contributions

(i) We propose a simple method, out-of-distribution data augmented training (OAT), to leverage OOD data for adversarial and standard learning, and theoretical analyses demonstrate how our proposed method can improve robust and standard generalization. (ii) The results of experimental procedures on CIFAR-10, CIFAR-100, and a subset of ImageNet suggest that OAT can help reduce the generalization gap in adversarial and standard learning. (iii) By applying OAT using various OOD datasets, it is shown that undesirable features are shared among diverse image datasets. It is also demonstrated that OAT can effectively extend training distribution by comparison with other data augmentation methods that can be employed in the absence of UID data. (iv) The state-of-the-art adversarial training method using UID data is found to further improve by incorporating the proposed method of leveraging OOD data.

2 Background

Undesirable features in adversarial learning

Tsipras et al. (2018) demonstrated the existence of a trade-off between standard accuracy and adversarial robustness with the distinction between robust and non-robust features. They showed the possibility that adversarial robustness is incompatible with standard accuracy by constructing a binary classification task that the data model consists of input-label pairs (𝒙,y)d+1×{±1}(\bm{\mathnormal{x}},\mathnormal{y})\in\mathbb{R}^{d+1}\times\{\pm 1\} sampled from a distribution as follows:

yu.a.r{1,+1},x1={+yw.p.pyw.p. 1p,x2,,xd+1i.i.d.𝒩(ϵy,1).\mathnormal{y}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\mathnormal{x}_{1}=\begin{cases}+\mathnormal{y}&\textup{w.p.}\;\mathnormal{p}\\ -\mathnormal{y}&\textup{w.p.}\;1-\mathnormal{p}\end{cases},\quad\mathnormal{x}_{2},\dots,\mathnormal{x}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\epsilon\mathnormal{y},1). (1)

Here, x1\mathnormal{x}_{1} is a robust feature that is strongly correlated with the label, and the other features x2,,xd+1\mathnormal{x}_{2},\dots,\mathnormal{x}_{d+1} are non-robust features that are weakly correlated with the label. ϵ\epsilon is small but sufficiently large such that a simple classifier attains a high standard accuracy, and p0.5\mathnormal{p}\geq 0.5. To characterize adversarial robustness, the definitions of expected standard loss βs\beta_{s} and expected adversarial loss βa\beta_{a} for a data distribution DD are defined as follows:

βs=𝔼(𝒙,y)D[(𝒙,y;θ)],βa=𝔼(𝒙,y)D[max𝜹S(𝒙+𝜹,y;θ)].\beta_{s}=\mathop{\mathbb{E}}_{(\bm{\mathnormal{x}},\mathnormal{y})\mathtt{\sim}D}\left[\mathcal{L}(\bm{\mathnormal{x}},\mathnormal{y};\theta{})\right],\quad\beta_{a}=\mathop{\mathbb{E}}_{(\bm{\mathnormal{x}},\mathnormal{y})\mathtt{\sim}D}\left[\max_{\mathbf{\bm{\delta}}\in S}\mathcal{L}(\bm{\mathnormal{x}+\delta},\mathnormal{y};\theta{})\right]. (2)

Here, (;θ)\mathcal{L}(;\theta) is the loss function of the model, and SS represents the set of perturbations that the adversary can apply to deceive the model. For Equation (1), Tsipras et al. (2018) showed that the following classifier can yield a small expected standard loss:

favg(𝒙)=sign(𝒘unif𝒙),where𝒘unif=[0,1d,,1d].f_{\textup{avg}}(\bm{\mathnormal{x}})=\textup{sign}(\bm{\mathnormal{w}}^{\top}_{\textup{unif}}\bm{\mathnormal{x}}),\quad\textup{where}\;\bm{\mathnormal{w}}_{\textup{unif}}=\left[0,\frac{1}{d},\dots,\frac{1}{d}\right]. (3)

They also proved that the classifier is vulnerable to adversarial perturbations, and that adversarial training results in a classifier that assigns zero weight values to non-robust features.

Transferability of Adversarial Perturbations

Naseer et al. (2019) produced domain-agnostic adversarial perturbations, thereby showing common adversarial space among different datasets. They showed that an adversarial function trained on Paintings, Cartoons or Medical data can deceive the classifier on ImageNet data with a high success rate. The study findings show that even datasets from considerably different domains share non-robust features. Therefore a method for supplementing the data needed for adversarial training is presented herein.

Undesirable features in standard learning

Wang et al. (2020) noted that convolutional neural networks (CNN) can capture high-frequency components in images that are almost imperceptible to a human. This ability is thought to be closely related to the generalization behaviors of CNNs, especially the capacity in memorizing random labels. Several studies (Geirhos et al., 2018; Bahng et al., 2019) reported that CNNs are biased towards local image features, and the generalization performance can be improved by regularizing that bias. In this context, a method of regularizing undesirable feature contributions using OOD data is proposed, assuming that undesirable features arise from the bias of CNNs or insufficient training data and are widely distributed in the input space.

3 Methods

3.1 Theoretical motivation

In this section, we analyze theoretically how OOD data can be used to make up for the insufficient training samples in adversarial training based on the dichotomy between robust and non-robust features. The theoretical motivation of using OOD data to reduce the contribution of undesirable features in standard learning can be found in Appendix B.

Setup and overview

We denote in-distribution data as target data. Given a target dataset {(𝒙~i,yi)}i=1n𝒳×{±1}\{(\tilde{\bm{\mathnormal{x}}}^{i},{\mathnormal{y}}^{i})\}_{i=1}^{n}\subset\mathcal{X}\times\{\pm 1\} sampled from a data distribution D~\tilde{D}, where 𝒳\mathcal{X} is the input space, we suppose that a feature extractor Φ:𝒳𝒵d\Phi:\mathcal{X}\to\mathcal{Z}\subset\mathbb{R}^{d} and a linear classification model are trained on {(𝒙~i,yi)}i=1n\{(\tilde{\bm{\mathnormal{x}}}^{i},{\mathnormal{y}}^{i})\}_{i=1}^{n} and target feature-label pairs {(Φ(𝒙~i),yi)}i=1n\{(\Phi(\tilde{\bm{\mathnormal{x}}}^{i}),{\mathnormal{y}}^{i})\}_{i=1}^{n}, respectively, to yield the small expected standard loss of the classification model. We then define an OOD dataset {𝒙^i}i=1m𝒳\{\hat{\bm{\mathnormal{x}}}^{i}\}_{i=1}^{m}\subset\mathcal{X} sampled from a data distribution D^\hat{D} that has the same distribution of non-robust features as that of D~\tilde{D} with reference to the preceding studies (Naseer et al., 2019; Moosavi-Dezfooli et al., 2017). After fixing Φ\Phi to facilitate the theoretical analysis on this framework, we demonstrate how adversarial training on the OOD dataset affects the weight values of our classifier.

Our data model

The feature extractor Φ\Phi can be considered to consist of several feature extractors ϕ:𝒳𝒵\phi:\mathcal{X}\to\mathcal{Z}\subset\mathbb{R}. Hence, we can set the distributions of the target feature-label pair (Φ(𝒙~),y)=(𝒛~,y)d×{±1}(\Phi(\tilde{\bm{\mathnormal{x}}}),\mathnormal{y})=(\tilde{\bm{\mathnormal{z}}},\mathnormal{y})\in\mathbb{R}^{d}\times\{\pm 1\} and the OOD feature vector Φ(𝒙^)=𝒛^d\Phi(\hat{\bm{\mathnormal{x}}})=\hat{\bm{\mathnormal{z}}}\in\mathbb{R}^{d} as follows:

yu.a.r{1,+1},z~1𝒩(y,u2),z~2,,z~d+1i.i.d.𝒩(ηy,1),qu.a.r{1,+1},z^1𝒩(0,v2),z^2,,z^d+1i.i.d.𝒩(ηq,1),\begin{gathered}\mathnormal{y}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\tilde{\mathnormal{z}}_{1}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(y,\mathnormal{u}^{2}),\quad\tilde{\mathnormal{z}}_{2},\dots,\tilde{\mathnormal{z}}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\eta\mathnormal{y},1),\\ \mathnormal{q}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\hat{\mathnormal{z}}_{1}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(0,\mathnormal{v}^{2}),\quad\hat{\mathnormal{z}}_{2},\dots,\hat{\mathnormal{z}}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\eta\mathnormal{q},1),\end{gathered} (4)

where u.a.ru.a.r stands for uniformly at random. From here on, we will only deal with the OOD data, therefore the accents (tilde and caret) that distinguish between target data and OOD data are omitted. In Equation (4), the feature z1=ϕ1(𝒙)\mathnormal{z}_{1}=\phi_{1}(\bm{\mathnormal{x}}) is the output of a robust feature extractor ϕ1\phi_{1}, and the other features z2,,zd+1\mathnormal{z}_{2},\dots,\mathnormal{z}_{d+1} are those of non-robust feature extractors ϕ2,,ϕd+1\phi_{2},\dots,\phi_{d+1}. Since the OOD input vectors do not have the same robust features as the target input vectors, z1\mathnormal{z}_{1} has zero mean and a small variance. Furthermore, because the OOD data have the same distribution of non-robust features as the target data, z2,,zd+1\mathnormal{z}_{2},\dots,\mathnormal{z}_{d+1} have a non-zero mean and a larger variance than the robust features. In addition, q\mathnormal{q} represents the unknown label associated with the non-robust features, and η\eta is a non-negative constant which represents the degree of correlation between the non-robust features and the unknown label. Please note that the input space of our classifier is the output space of Φ\Phi in our data model. Therefore, η\eta is not limited to a small value even in the context of the p-\ell_{p}\text{-}bounded adversary. Rather, the high degree of confidence that DNNs show for the adversarial examples (Goodfellow et al., 2014) suggests that η\eta is large.

Our linear classification model

According to Section 2, we know that our linear classification model (logistic regression), defined as follows, yields a low expected standard loss while demonstrating high adversarial vulnerability.

p(y=+1𝒛)=σ(𝒘𝒛),p(y=1𝒛)=1σ(𝒘𝒛),where𝒘=[0,1d,,1d].p(\mathnormal{y}=+1\mid\bm{\mathnormal{z}})=\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}}),\;p(\mathnormal{y}=-1\mid\bm{\mathnormal{z}})=1-\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}}),\quad\textup{where}\;\bm{\mathnormal{w}}=\left[0,\frac{1}{d},\dots,\frac{1}{d}\right]. (5)

To observe the effect of adversarial training on the OOD dataset, we train our classifier by applying the stochastic gradient descent algorithm to the cross-entropy loss function (;𝒘)\mathcal{L}(;\bm{\mathnormal{w}}).

Firstly, we construct the adversarial feature vector 𝒛¯=Φ(𝒙+𝜹):𝜹S\bar{\bm{\mathnormal{z}}}=\Phi(\bm{\mathnormal{x}}+\bm{\delta}):\bm{\delta}\in S against our classifier for adversarial training.

Theorem 1.

Let t[0,1]\mathnormal{t}\in[0,1] be the given target value of the feature vector 𝐳\bm{\mathnormal{z}} in our classification model, and λ\lambda be a non-negative constant. Then, when t=0.5\mathnormal{t}=0.5, the expectation of the adversarial feature vector is

𝔼𝒛[z¯1]=𝔼𝒛[z1],𝔼𝒛[z¯k]𝔼𝒛[zk]+λq,wherek{2,,d+1}.\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right],\quad\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{k}\right]\approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}\right]+\lambda\cdot\mathnormal{q},\quad\textup{where}\;k\in\{2,\dots,d+1\}. (6)

(All the proofs of the theorems in this paper can be found in Appendix A.) Here, we assume the -\ell_{\infty}\text{-}bounded adversary. In Theorem 1, we can observe that the adversary pushes the non-robust features farther in the direction of the unknown label q\mathnormal{q}, which coincides with our intuition. When the given target value t\mathnormal{t} is 0.50.5, the adversary will make our classification model output equal to zero or one to yield a large loss.

Our classification model is trained on the adversarial features shown in Theorem 1.

Theorem 2.

When t=0.5\mathnormal{t}=0.5, the expected gradient of the loss function (𝐳¯,t;𝐰)\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}}) with respect to the weight vector 𝐰\bm{\mathnormal{w}} of our classification model is

𝔼𝒛¯[w1]0,𝔼𝒛¯[wk]12(η+λ),wherek{2,,d+1}.\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx 0,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}. (7)

Thus, the adversarial training with t=0.5\mathnormal{t}=0.5 on the OOD dataset leads to the weight values corresponding to the non-robust features converging to zero while preserving w1\mathnormal{w}_{1} from the gradient update. This shows that we can reduce the impact of non-robust features using the OOD dataset. We, however, should not only reduce the influence of the non-robust features, but also improve the classification accuracy using the robust feature. This can be achieved through the adversarial training on the target dataset. Accordingly, we show the effect of the adversarial training on the OOD dataset when w1>0\mathnormal{w}_{1}>0 in our example.

Theorem 3.

When t=0.5\mathnormal{t}=0.5 and w1>0\mathnormal{w}_{1}>0, the expected gradient of the loss function (𝐳¯,t;𝐰)\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}}) with respect to the weight vector 𝐰\bm{\mathnormal{w}} of our classification model is

𝔼𝒛¯[w1]12λ,𝔼𝒛¯[wk]12(η+λ),wherek{2,,d+1}.\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx\frac{1}{2}\lambda,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}. (8)

Theorem 3 shows that when w1>0\mathnormal{w}_{1}>0, the adversarial training with t=0.5\mathnormal{t}=0.5 on the OOD dataset reduces the influence of all the features in 𝒵\mathcal{Z}. However, we can see that the expected gradients for the weight values associated with the non-robust features are always greater than the expected gradient for the weight value associated with the robust feature. In addition, the greater the value of η\eta, the faster the weight value associated with it converges to zero. This means that the contribution of non-robust features with high influence decreases rapidly. In the case of multiclass classification, it is straightforward that t=0.5\mathnormal{t}=0.5 corresponds to uniform distribution label tunif=[1c,,1c]\mathnormal{t}_{\textup{unif}}=[\frac{1}{c},\dots,\frac{1}{c}], where c\mathnormal{c} is the number of the classes. Intuitively, the meaning of (x,tunif\mathbf{\mathnormal{x}},\mathnormal{t}_{\textup{unif}}) is that the input x\mathbf{\mathnormal{x}} lies on the decision boundary. To reduce the training loss for (x,tunif\mathbf{\mathnormal{x}},\mathnormal{t}_{\textup{unif}}), the classifier will learn that the features of x\mathbf{\mathnormal{x}} do not contribute to a specific class, which can be understood as removing the contributions of the features of x\mathbf{\mathnormal{x}}.

3.2 Out-of-distribution data augmented training

Based on our theoretical analysis, we introduce Out-of-distribution data Augmented Training (OAT). OAT is the training on the union of the target dataset 𝒟t\mathcal{D}_{t} and the OOD dataset 𝒟o\mathcal{D}_{o}. When applying our proposed method, we need to consider the following two points: 1) Temporary labels associated with OOD data samples are required for supervised learning, and 2) the loss functions corresponding to 𝒟t\mathcal{D}_{t} and 𝒟o\mathcal{D}_{o} should be properly combined.

First, we assign a uniform distribution label tunif\mathnormal{t}_{\textup{unif}} to all the OOD data samples as confirmed in our theoretical analysis. This labeling method enables us to leverage OOD data for supervised learning at no extra cost. Moreover, it means that our method is completely free from the limitations of the methods using UID data (see Section 1).

Second, although OOD data can be used to improve the standard and robust generalization of neural networks, the training on target data is essential to enhance the classification accuracy of neural networks. In addition, according to Theorem 3, adversarial training on the pairs of OOD data samples and tunif\mathnormal{t}_{\textup{unif}} affects the weight for robust features as well as that for non-robust features. Hence, the balance between losses from 𝒟t\mathcal{D}_{t} and 𝒟o\mathcal{D}_{o} is important in OAT. For this reason, we introduce a hyperparameter α+\alpha\in\mathbb{R}^{+} into our proposed method and train neural networks as follows:

OAT-A:minθmissingE(𝒙t,y)𝒟t[max𝜹S(𝒙t+𝜹,y;θ)]+αmissingE𝒙o𝒟o[maxϵS(𝒙o+ϵ,tunif;θ)],OAT-S:minθmissingE(𝒙t,y)𝒟t[(𝒙t,y;θ)]+αmissingE𝒙o𝒟o[(𝒙o,tunif;θ)].\begin{gathered}\textup{OAT-A:}\quad\min_{\theta}\mathop{\mathbb{missing}}{E}_{(\bm{\mathnormal{x}}_{t},y)\in\mathcal{D}_{t}}\left[\max_{\mathbf{\bm{\delta}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{t}+\bm{\delta},\mathnormal{y};\theta{})\right]+\alpha\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{o}\in\mathcal{D}_{o}}\left[\max_{\mathbf{\bm{\epsilon}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{o}+\bm{\epsilon},\mathnormal{t}_{\textup{unif}};\theta{})\right],\\ \textup{OAT-S:}\quad\min_{\theta}\mathop{\mathbb{missing}}{E}_{(\bm{\mathnormal{x}}_{t},y)\in\mathcal{D}_{t}}\left[\mathcal{L}(\bm{\mathnormal{x}}_{t},\mathnormal{y};\theta{})\right]+\alpha\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{o}\in\mathcal{D}_{o}}\left[\mathcal{L}(\bm{\mathnormal{x}}_{o},\mathnormal{t}_{\textup{unif}};\theta{})\right].\end{gathered} (9)

Here, OAT-A and OAT-S represent OOD data augmented adversarial and standard learning, respectively. The pseudo-code for the overall procedure of our method is presented in Algorithm 1.

Algorithm 1 Out-of-distribution augmented Training (OAT)
0:  Target dataset 𝒟t\mathcal{D}_{t}, OOD dataset 𝒟o\mathcal{D}_{o}, uniform distribution label tunift_{\textup{unif}}, batch size nn, training iterations TT, learning rate τ\tau, hyperparameter α\alpha, adversarial attack function 𝒢\mathcal{G}
1:  for t=1t=1 to TT do
2:     (Xt,Y)=SAMPLE(dataset=𝒟t,size=n2)(X_{t},Y)=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{t},\textup{size}=\frac{n}{2})
3:     Xo=SAMPLE(dataset=𝒟o,size=n2)X_{o}=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{o},\textup{size}=\frac{n}{2})
4:     if Adversarial Learning then
5:        [X¯t,X¯o]𝒢([Xt,Xo],[Y,tunif];𝜽)\left[\bar{X}_{t},\bar{X}_{o}\right]\leftarrow\mathcal{G}(\left[X_{t},X_{o}\right],\left[Y,t_{\textup{unif}}\right];\bm{\theta})
6:     else if Standard Learning then
7:        [X¯t,X¯o][Xt,Xo]\left[\bar{X}_{t},\bar{X}_{o}\right]\leftarrow\left[X_{t},X_{o}\right]
8:     end if
9:     model update:
10:     𝜽𝜽τθAVERAGE(12(X¯t,Y;𝜽)+α2(X¯o,tunif;𝜽))\bm{\theta}\leftarrow\bm{\theta}-\tau\cdot\nabla_{\theta}\textup{AVERAGE}(\frac{1}{2}\mathcal{L}(\bar{X}_{t},Y;\bm{\theta})+\frac{\alpha}{2}\mathcal{L}(\bar{X}_{o},t_{\textup{unif}};\bm{\theta}))
11:  end for
12:  Output: trained model parameter 𝜽\bm{\theta}

4 Related studies

Adversarial examples

Many adversarial attack methods have been proposed, including the projected gradient descent (PGD) and Carlini & Wagner (CW) attacks (Madry et al., 2017; Carlini & Wagner, 2017). The PGD attack employs an iterative procedure of the fast gradient sign method (FGSM) (Goodfellow et al., 2014) to find worst-case examples in which the training loss is maximized. CW attack finds adversarial examples using CW losses instead of cross-entropy losses. Recently, Croce & Hein (2020) proposed autoattack (AA), which is a powerful ensemble attack with two extensions of the PGD attack and two existing attacks (Croce & Hein, 2019; Andriushchenko et al., 2019). To defend against these adversarial attacks, various adversarial defense methods have been developed (Goodfellow et al., 2014; Kannan et al., 2018). Madry et al. (2017) introduced adversarial training that uses adversarial examples as training data, and Zhang et al. (2019) proposed TRADES to optimize a surrogate loss which is a sum of the natural error and boundary error.

OOD detection

Lee et al. (2017) and Hendrycks et al. (2018) dealt with the overconfidence problem of the confidence score-based OOD detectors. They used uniform distribution labels as in our method to resolve the overconfidence issue. In particular, Augustin et al. (2020) addressed the overconfidence problem in adversarial settings. The authors proposed a method that is practically identical to OAT-A by combining adversarial training with ACET (Hein et al., 2019). They did not, however, address the generalization problems of neural networks. Specifically, our theoretical results allow us to explain the classification performance improvement of classifiers that were only considered as secondary effects in the abovementioned studies. Further related works can be found in Appendix C.

5 Experimental results and discussion

5.1 Experimental setup

OOD datasets

We created OOD datasets from the 80 Million Tiny Images dataset (Torralba et al., 2008) (80M-TI), using the work of Carmon et al. (2019) for CIFAR-10 and CIFAR-100, respectively. In addition, we resized (using a bilinear interpolation) ImageNet to dimensions of 64×6464\times 64 and 160×160160\times 160 and divided it into datasets containing 10 and 990 classes, respectively; these are called ImgNet10 and ImgNet990. Furthermore, we resized Places365 and VisDA-17 for the experiments on ImgNet10 and cropped the Simpson Characters (Simpson), and Fashion Product (Fashion) datasets to dimensions of 32×3232\times 32 for the experiments on CIFAR10 and CIFAR100. The details in sourcing the OOD datasets can be found in Appendix D.

Implementation details

Implementation details including the hyperparameter α\alpha, architectures, batch sizes, and training iterations are summarized in Appendix E. All models compared in the same target dataset are trained in the same training batch size and iterations to ensure a fair comparison. In other words, OAT have a batch size of targetn2+OODn2\textup{target}\frac{n}{2}+\textup{OOD}\frac{n}{2}, which would be compared with a model that is normally trained with a batch size of nn. To evaluate the adversarial robustness of the models in our experiments, we apply several adversarial attacks including PGD, CW, and AA. Note that we denote PGD and CW attacks with TT iterative steps as PGDTT and CWTT, respectively, and the original test set as Clean. We compare the following models in our experiments111Our codes are available at https://github.com/Saehyung-Lee/OAT.:

  1. 1.

    Standard: The model which is normally trained on the target dataset.

  2. 2.

    PGD: The model trained using PGD-based adversarial training on the target dataset.

  3. 3.

    TRADES: The model trained using TRADES on the target dataset.

  4. 4.

    OATPGD\textup{OAT}_{\textup{PGD}}: The model which is adversarially trained with OAT based on a PGD approach.

  5. 5.

    OATTRADES\textup{OAT}_{\textup{TRADES}}: The model which is adversarially trained with OAT based on TRADES.

  6. 6.

    OAT𝒟o\textup{OAT}_{\mathcal{D}_{o}}: The model which is normally trained with OAT using the OOD dataset 𝒟o\mathcal{D}_{o}.

5.2 Study on the effectiveness of OAT in adversarial learning

Table 1: Accuracy (%) comparison of the OAT model with Standard, PGD, and TRADES on CIFAR10, CIFAR100, and ImgNet10 (64×\times64) under different threat models. We show the improved results compared to the counterpart of each model in bold.
Model Target OOD Clean PGD100 CW100 AA
Standard CIFAR10 - 95.48 0.00 0.00 0.00
PGD - 87.48 49.92 50.80 48.29
PGD+CutMix - 89.35 53.39 52.35 49.05
TRADES - 85.24 55.69 54.04 52.83
OATPGD 80M-TI 86.63 56.77 52.38 49.98
OATTRADES 80M-TI 86.76 59.66 55.71 54.63
Standard CIFAR100 - 78.57 0.02 0.00 0.00
PGD - 61.37 24.66 24.68 22.76
TRADES - 58.84 30.24 27.97 26.91
OATPGD 80M-TI 61.54 30.02 27.85 25.36
OATTRADES 80M-TI 63.07 34.23 29.02 27.83
Standard ImgNet10 (64 x 64) - 86.03 0.11 0.06 0.00
PGD - 82.80 48.77 48.86 48.34
OATPGD ImgNet990 81.91 59.03 54.69 53.83
Table 2: Comparison of OAT using various OOD datasets for improving robust generalization on CIFAR10 and ImgNet10 (64×6464\times 64). The None represents the baseline model (PGD).
CIFAR10 ImgNet10 (64 x 64)
\hlineB2    OOD None SVHN Simpson
Fashion
None Places365
VisDA17
\hlineB1.5    Clean 87.48 86.16 86.79 85.84 82.80 82.37 82.46
PGD20 50.41 53.70 53.88 53.27 49.00 59.86 55.34
CW20 51.11 52.21 52.15 51.70 48.91 56.23 53.80

Evaluating adversarial robustness

The improvements in the robust generalization performances of the PGD and TRADES models through the application of OAT are evaluated against PGD100, CW100, and AA with -\ell_{\infty}\text{-}bound of 8255\frac{8}{255} and 0.031. The results are summarized in Table 1 and indicate that OAT improves the robust generalization of all adversarial training methods tested regardless of the target dataset. In particular, from the results against AA it can be seen that the effectiveness of OAT does not rely on obfuscated gradients (Athalye et al., 2018). This is because AA removes the possibility of gradient masking through the application of a combination of strong adaptive attacks (Croce & Hein, 2019; 2020) and a black-box attack (Andriushchenko et al., 2019).

However, while OAT brings a significant improvement in robustness against PGD attacks, it is relatively less effective against CW attacks. According to our theoretical analysis, these results imply that the OOD data have relatively few non-robust features used in the CW attacks. In other words, powerful targeted attacks, such as CW attacks, are constructed using gradient information more selectively than untargeted attacks, such as PGD attacks. To gain insight into this phenomenon, we train models with various hyperparameter α\alpha values. The results suggest that the greater the influence of OOD on the training process is, the higher the robustness against PGD attacks and the lower the robustness against CW attacks are (see Appendix G for more details).

In the absence of UID data, various data augmentation methods other than OAT can be employed. CutMix (Yun et al., 2019), a method of amplifying training data by cutting and pasting patches between training images, was recently proposed and exhibited excellent performance in classification and transfer learning tasks. By applying CutMix to adversarial training, the advantages of OAT over other data augmentation methods are assessed. Specifically, OAT brings a higher level of robustness than CutMix, as deduced from Table 1. Because adversarial examples are closely related to the high-frequency components of images (Wang et al., 2020) and adversarial training reduces the model’s sensitivity to these components, CutMix is an ineffective data augmentation method. Although CutMix extends the global feature distribution, there is no significant difference in terms of local feature distribution. Contrarily, OAT can effectively regularize the classifier by observing various undesirable features held by additional data in the learning process.

OAT with diverse OOD datasets

Non-robust features are widely shared among different datasets, as demonstrated by applying OAT using various OOD datasets. Table 5.2 indicates that OAT improves robust generalization for all OOD datasets, including those that have little correlation with the target dataset from a human perspective. In addition, given that relatively simple datasets with no background (Fashion and VisDA-17) are less effective than others, we can suppose that non-robust features arise from the high complexity (Snodgrass & Vanderwart, 1980) of images for natural image datasets, such as CIFAR and ImageNet.

When UID data are available

In adversarial training, in-distribution data reduce the sensitivity of neural networks to non-robust features and provide robustly generalizable features. On the other hand, the proposed theory shows that OAT takes advantage of the data amplification effect only for non-robust features using OOD data. Therefore, the effectiveness of OAT is expected to decrease as more in-distribution data are augmented. To observe such a trend empirically, the effect of OAT is investigated according to additional in-distribution data size using previously published pseudo-labeled data (Carmon et al., 2019).

Refer to caption
Figure 1: (a) The effectiveness of OAT with increasing pseudo-labeled data size under PGD100 and (b) the results of the state-of-the-art model improved by OAT under PGTRADES, PGMadry, PGRST (Carmon et al., 2019), and AA on CIFAR-10.

Surprisingly, Figure 1(a) shows that OAT still improves robust generalization even when many pseudo-labeled data are used. OAT is also combined with RST (Carmon et al., 2019), which has recently recorded a state-of-the-art adversarial robustness using UID data; in fact, Figure 1(b) demonstrates that OAT can further improve the state-of-the-art adversarial training method. This is presumably because the additional data include noisy labeled data, in which induce memorization in the learning process and thus impair the generalization performance (Zhang et al., 2016). OAT seems to achieve a higher level of robust generalization by effectively suppressing the effects of noisy data. Additional details on the experiments illustrated in Figure 1 are described in Appendix H.

5.3 Study on the effectiveness of OAT in standard learning

Randomization test

The effect of OAT is analyzed based on the randomization test (Zhang et al., 2016). The randomization test is an experiment aiming to observe the effective capacity of neural networks and the effect of regularization by training the model on a copy of the data where the true labels are replaced by random labels.

Refer to caption
Figure 2: (a) The results of the randomization test and (b) the test error curves for the Standard and OAT models along their optimization trajectories on CIFAR-10.

In Figure 2(a), it can be seen that the Standard model memorizes all training samples to obtain a low training error, whereas the OAT model continues to have a high training error. These results show that OAT effectively regularizes neural networks so that it learns only features with strong correlation with the class labels. Figure 2(b) indicates that the OAT model learns more slowly than the Standard model owing to the influence of a strong regularizer in the early stages of training; however, it achieves a better generalization performance at the end of training.

Evaluating classification performance

The effect of OAT on the classification accuracy is shown in Table 3. When the number of training samples (NN) is small, the influence of undesirable features is expected to be large. Therefore, the experiments are classified depending on the number of training samples (NN).

Table 3: Accuracy (%, mean over 5 runs) comparison of the OAT model using various datasets with the baseline models. We show the improved results compared to the counterpart of each model in bold. Pseudo-label is the model trained using pseudo-labeled data (from 80M-TI) in conjunction with the target dataset. Fusion is the model that applies OAT80M-TI\textup{OAT}_{\textup{80M-TI}} to the Pseudo-label model. A detailed description can be found in Appendix D.
Dataset CIFAR10 CIFAR100
NN 2,500 / Full 2,500 / Full
Standard 65.44 / 94.46 24.41 / 74.87
OATSVHN 68.56 / 94.45 24.82 / 75.65
OATSimpson 70.08 / 94.43 27.04 / 76.03
OAT80M-TI 72.49 / 95.20 26.13 / 76.30
Pseudo-label  - / 95.28  - / 77.24
Fusion  - / 95.53  - / 77.36
Dataset
ImgNet10
(64 x 64)
ImgNet10
(160 x 160)
NN 100 / Full 100 / Full
Standard 37.90 / 86.93 33.36 / 90.91
OATVisDA17 36.21 / 86.71 35.93 / 91.23
OATPlaces365 41.84 / 88.37 40.11 / 91.42
OATImgNet990 42.18 / 87.88 40.41 / 91.87

Table 3 indicates that the effect of OAT is large when the amount of training data is small, as predicted. Additionally, OAT enhances the generalization performance using 80M-TI, ImgNet990, and Places365, which have input distributions similar to the target datasets, more than when using other OOD datasets. This proves empirically that in our theoretical analysis, the OOD data, following the same undesirable feature distribution as the target data, can improve generalization through OAT. In addition, the results of Pseudo-label and Fusion models show that even when pseudo-labeled data are available, OOD data can be leveraged to further improve the standard generalization performance. Moreover, the proposed method, which does not require complex operations and is very simple to implement, can lead to higher performance by combining with existing data augmentation methods. As an example, the effectiveness of Mixup (Zhang et al., 2017) is enhanced by applying OAT; the experimental results are provided in Appendix I.

Finally, OAT in a standard learning scheme using the entire target dataset is generally less effective than OAT in an adversarial training scheme (see Appendix F for more details). Therefore, it can be inferred that the transferability of undesirable features is greater in adversarial settings than in standard settings. In other words, these results experimentally show that the number of training samples required for robust generalization is large compared to that required for standard generalization.

6 Conclusions and future directions

In this study, a method is proposed to compensate for the insufficient training data by using OOD data, which are less restrictive than UID data. It is theoretically demonstrated that training with OOD data can remove undesirable feature contributions in a simple Gaussian model. Experiments are performed on various OOD datasets, which surprisingly demonstrate that even OOD datasets that apparently have little correlation with the target dataset from the human perspective can help standard and robust generalization through the proposed method. These results imply that a common undesirable feature space exists among diverse datasets. In addition, the effectiveness of the proposed method is evaluated when extra UID data are available, and the results indicate that OAT can improve the generalization performance even when substantial pseudo-labeled data are used.

Nevertheless, some limitations need to be acknowledged. First, it is challenging to predict the effectiveness of the proposed method before applying it to a specific target-OOD dataset pair. Second, our method is less effective against strong targeted adversarial attacks, such as CW attacks, as it is difficult to generate deliberate adversarial attacks on in-distribution in the process of OAT. Therefore, as a future research direction, we aim to quantify the degree to which undesirable features are shared between the target and OOD datasets and construct strong adversarial attacks using OOD data.

Acknowledgements:

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2018R1A2B3001628], the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2020, and AIR Lab (AI Research Lab) in Hyundai & Kia Motor Company through HKMC-SNU AI Consortium Fund.

References

  • Aggarwal (2018) Param Aggarwal. Fashion product images (small), 2018. data retrieved from Kaggle, https://www.kaggle.com/paramaggarwal/fashion-product-images-small.
  • Andriushchenko et al. (2019) Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. arXiv preprint arXiv:1912.00049, 2019.
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
  • Attia (2018) Alexandre Attia. The simpsons characters data, 2018. data retrieved from Kaggle, https://www.kaggle.com/alexattia/the-simpsons-characters-dataset.
  • Augustin et al. (2020) Maximilian Augustin, Alexander Meinke, and Matthias Hein. Adversarial robustness on in-and out-distribution improves explainability. In European Conference on Computer Vision, pp.  228–245. Springer, 2020.
  • Bahng et al. (2019) Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. arXiv preprint arXiv:1910.02806, 2019.
  • Bellman (1961) Robert Bellman. Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ, 3:2, 1961.
  • Ben-Tal et al. (2013) Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
  • Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017.
  • Carmon et al. (2019) Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, pp. 11190–11201, 2019.
  • Chan et al. (2020) Alvin Chan, Yi Tay, and Yew-Soon Ong. What it thinks is important is important: Robustness transfers through input gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Croce & Hein (2019) Francesco Croce and Matthias Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. arXiv preprint arXiv:1907.02044, 2019.
  • Croce & Hein (2020) Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. arXiv preprint arXiv:2003.01690, 2020.
  • Dai et al. (2017) Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, pp. 6510–6520, 2017.
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • Geirhos et al. (2018) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pp.  630–645. Springer, 2016.
  • Hein et al. (2019) Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  41–50, 2019.
  • Hendrycks et al. (2018) Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.
  • Hendrycks et al. (2019) Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. arXiv preprint arXiv:1901.09960, 2019.
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
  • (23) Jeremy Howard. Imagenette. URL https://github.com/fastai/imagenette/.
  • Kannan et al. (2018) Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • Lee et al. (2017) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.
  • Lee et al. (2020) Saehyung Lee, Hyungyu Lee, and Sungroh Yoon. Adversarial vertex mixup: Toward better adversarially robust generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Moosavi-Dezfooli et al. (2017) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1765–1773, 2017.
  • Najafi et al. (2019) Amir Najafi, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems, pp. 5542–5552, 2019.
  • Naseer et al. (2019) Muhammad Muzammal Naseer, Salman H Khan, Muhammad Haris Khan, Fahad Shahbaz Khan, and Fatih Porikli. Cross-domain transferability of adversarial perturbations. In Advances in Neural Information Processing Systems, pp. 12885–12895, 2019.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Peng et al. (2017) Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge, 2017.
  • Poursaeed et al. (2018) Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4422–4431, 2018.
  • Schmidt et al. (2018) Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pp. 5014–5026, 2018.
  • Snodgrass & Vanderwart (1980) Joan G Snodgrass and Mary Vanderwart. A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. Journal of experimental psychology: Human learning and memory, 6(2):174, 1980.
  • Stanforth et al. (2019) Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli, et al. Are labels required for improving adversarial robustness? arXiv preprint arXiv:1905.13725, 2019.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Torralba et al. (2008) Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.
  • Tsipras et al. (2018) Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  • Vapnik & Vapnik (1998) Vladimir Vapnik and Vlamimir Vapnik. Statistical learning theory wiley. New York, 1, 1998.
  • Wang et al. (2020) Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P. Xing. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Xie et al. (2019) Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252, 2019.
  • Yun et al. (2019) Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pp.  6023–6032, 2019.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7472–7482, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/zhang19p.html. https://github.com/yaodongyu/TRADES.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhou et al. (2017) Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Appendix A Proofs

Theorem 1.

Let t[0,1]\mathnormal{t}\in[0,1] be the given target value of the feature vector 𝐳\bm{\mathnormal{z}} in our classification model, and λ\lambda be a non-negative constant. Then, when t=0.5\mathnormal{t}=0.5, the expectation of the adversarial feature vector is

𝔼𝒛[z¯1]=𝔼𝒛[z1],𝔼𝒛[z¯k]𝔼𝒛[zk]+λq,wherek{2,,d+1}.\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right],\quad\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{k}\right]\approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}\right]+\lambda\cdot\mathnormal{q},\quad\textup{where}\;k\in\{2,\dots,d+1\}. (10)
Proof.

Let 𝜹𝒙\bm{\delta_{\mathnormal{x}}} be the adversarial perturbation in the input space. Then,

𝜹𝒙=argmax𝜹(Φ(𝒙+𝜹),t;𝒘)argmax𝜹(Φ(𝒙)+𝜹𝒙Φ,t;𝒘)=argmax𝜹(𝒛+𝜹𝒙𝒛,t;𝒘)=argmax𝜹(𝒛+𝜹𝒛,t;𝒘),where𝜹𝒛=𝜹𝒙𝒛.\begin{gathered}\bm{\delta_{\mathnormal{x}}}=\operatorname*{arg\,max}_{\bm{\delta}}\mathcal{L}(\Phi(\bm{\mathnormal{x}}+\bm{\delta}),\mathnormal{t};\bm{\mathnormal{w}})\approx\operatorname*{arg\,max}_{\bm{\delta}}\mathcal{L}(\Phi(\bm{\mathnormal{x}})+\bm{\delta}^{\top}\nabla_{\bm{\mathnormal{x}}}\Phi,\mathnormal{t};\bm{\mathnormal{w}})\\ =\operatorname*{arg\,max}_{\bm{\delta}}\mathcal{L}(\bm{\mathnormal{z}}+\bm{\delta}^{\top}\nabla_{\bm{\mathnormal{x}}}\bm{\mathnormal{z}},\mathnormal{t};\bm{\mathnormal{w}})=\operatorname*{arg\,max}_{\bm{\delta}}\mathcal{L}(\bm{\mathnormal{z}}+\bm{\delta}_{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}}),\quad\textup{where}\;\bm{\delta_{\mathnormal{z}}}=\bm{\delta}^{\top}\nabla_{\bm{\mathnormal{x}}}\bm{\mathnormal{z}}.\end{gathered} (11)

Equation 11 suggests that constructing the adversarial perturbation 𝜹𝒙\bm{\delta_{\mathnormal{x}}} in the input space can be approximated by finding the adversarial perturbation 𝜹𝒛\bm{\delta_{\mathnormal{z}}} in the feature space. These have a relationship of 𝜹𝒛=𝜹𝒙𝒙𝒛\bm{\delta_{\mathnormal{z}}}=\bm{\delta_{\mathnormal{x}}}^{\top}\nabla_{\bm{\mathnormal{x}}}\bm{\mathnormal{z}} by linear approximation. Hence, without loss of generality, the perturbed feature vector 𝒛¯\bar{\bm{\mathnormal{z}}} can be approximated by 𝒛+λsign(𝒛(𝒛,t;𝒘))\bm{\mathnormal{z}}+\lambda\cdot\textup{sign}(\nabla_{\bm{\mathnormal{z}}}\mathcal{L}(\bm{\mathnormal{z}},\mathnormal{t};\bm{\mathnormal{w}})) with an λ>0\lambda>0. Then,

𝔼𝒛[z¯1]=𝔼𝒛[z1+λsign(z1(𝒛,t;𝒘))]=𝔼𝒛[z1+λsign(w1(σ(𝒘𝒛)12))]=𝔼𝒛[z1]+λ𝔼𝒛[sign(0)]=𝔼𝒛[z1].\begin{gathered}\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}+\lambda\cdot\textup{sign}(\nabla_{\mathnormal{z}_{1}}\mathcal{L}(\bm{\mathnormal{z}},\mathnormal{t};\bm{\mathnormal{w}}))\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}+\lambda\cdot\textup{sign}(\mathnormal{w}_{1}(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\frac{1}{2}))\right]\\ =\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right]+\lambda\mathbb{E}_{\bm{\mathnormal{z}}}\left[\textup{sign}(0)\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right].\end{gathered} (12)

Because our classification model was trained to minimize the expected standard loss, 𝔼𝒛[sign(wk(σ(𝒘𝒛)12))]\mathbb{E}_{\bm{\mathnormal{z}}}\left[\textup{sign}(\mathnormal{w}_{k}(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\frac{1}{2}))\right] can be approximated by qq. Then,

𝔼𝒛[z¯k]=𝔼𝒛[zk+λsign(zk(𝒛,t;𝒘))]=𝔼𝒛[zk+λsign(wk(σ(𝒘𝒛)12))]𝔼𝒛[zk]+λq.\begin{gathered}\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{k}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}+\lambda\cdot\textup{sign}(\nabla_{\mathnormal{z}_{k}}\mathcal{L}(\bm{\mathnormal{z}},\mathnormal{t};\bm{\mathnormal{w}}))\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}+\lambda\cdot\textup{sign}(\mathnormal{w}_{k}(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\frac{1}{2}))\right]\\ \approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}\right]+\lambda\cdot\mathnormal{q}.\end{gathered} (13)

Theorem 2.

When t=0.5\mathnormal{t}=0.5, the expected gradient of the loss function (𝐳¯,t;𝐰)\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}}) with respect to the weight vector 𝐰\bm{\mathnormal{w}} of our classification model is

𝔼𝒛¯[w1]0,𝔼𝒛¯[wk]12(η+λ),wherek{2,,d+1}.\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx 0,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}. (14)
Proof.

Based on the adversarial vulnerability of our classifier, σ(𝒘𝒛¯)\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}}) can be approximated by 12(1+q)\frac{1}{2}(1+q). Therefore,

𝔼𝒛¯[w1]=𝔼𝒛¯[z¯1(σ(𝒘𝒛¯)12)]=𝔼𝒛¯[z¯1σ(𝒘𝒛¯)]12𝔼𝒛¯[z¯1]12𝔼𝒛¯[z¯1](1+q)12𝔼𝒛¯[z¯1]=q2𝔼𝒛¯[z¯1]=0,𝔼𝒛¯[wk]=𝔼𝒛¯[z¯k(σ(𝒘𝒛¯)12)]=𝔼𝒛¯[z¯kσ(𝒘𝒛¯)]12𝔼𝒛¯[z¯k]12𝔼𝒛¯[z¯k](1+q)12𝔼𝒛¯[z¯k]=q2𝔼𝒛¯[z¯k]=12(η+λ).\begin{gathered}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}(\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})-\frac{1}{2})\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\cdot\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})\right]-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]\\ \approx\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right](1+q)-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]=\frac{q}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]=0,\\ \mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}(\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})-\frac{1}{2})\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}\cdot\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})\right]-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}\right]\\ \approx\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}\right](1+q)-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}\right]=\frac{q}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{k}\right]=\frac{1}{2}(\eta+\lambda).\end{gathered} (15)

Theorem 3.

When t=0.5\mathnormal{t}=0.5 and w1>0\mathnormal{w}_{1}>0, the expected gradient of the loss function (𝐳¯,t;𝐰)\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}}) with respect to the weight vector 𝐰\bm{\mathnormal{w}} of our classification model is

𝔼𝒛¯[w1]12λ,𝔼𝒛¯[wk]12(η+λ),wherek{2,,d+1}.\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx\frac{1}{2}\lambda,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}. (16)
Proof.
𝔼𝒛[z¯1]=𝔼𝒛[z1+λsign(z1(𝒛,t;𝒘))]=𝔼𝒛[z1+λsign(w1(σ(𝒘𝒛)12))]𝔼𝒛[z1]+λq,𝔼𝒛¯[w1]=𝔼𝒛¯[z¯1(σ(𝒘𝒛¯)12)]=𝔼𝒛¯[z¯1σ(𝒘𝒛¯)]12𝔼𝒛¯[z¯1]12𝔼𝒛¯[z¯1](1+q)12𝔼𝒛¯[z¯1]=q2𝔼𝒛¯[z¯1]=12λ.\begin{gathered}\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}+\lambda\cdot\textup{sign}(\nabla_{\mathnormal{z}_{1}}\mathcal{L}(\bm{\mathnormal{z}},\mathnormal{t};\bm{\mathnormal{w}}))\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}+\lambda\cdot\textup{sign}(\mathnormal{w}_{1}(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\frac{1}{2}))\right]\\ \approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right]+\lambda\cdot q,\\ \mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}(\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})-\frac{1}{2})\right]=\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\cdot\sigma(\bm{\mathnormal{w}}^{\top}\bar{\bm{\mathnormal{z}}})\right]-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]\\ \approx\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right](1+q)-\frac{1}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]=\frac{q}{2}\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\bar{\mathnormal{z}}_{1}\right]=\frac{1}{2}\lambda.\end{gathered} (17)

Appendix B The Theoretical Motivation of OAT in Standard Learning

Based on the same setup described in Section 3 of the main manuscript, we construct the data model for OAT in the following manner:

qu.a.r{1,+1},z1𝒩(0,σ12),z2𝒩(κq,σ22),whereκ+,σ12σ22.\mathnormal{q}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\mathnormal{z}_{1}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(0,\sigma_{1}^{2}),\quad\mathnormal{z}_{2}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(\kappa q,\sigma_{2}^{2}),\quad\textup{where}\;\kappa\in\mathbb{R}^{+},\;\sigma_{1}^{2}\ll\sigma_{2}^{2}. (18)

For simplicity, we consider only one desirable feature extractor ϕ1\phi_{1} and one undesirable feature extractor ϕ2\phi_{2}. In Equation (18), features z1=ϕ1(𝒙)\mathnormal{z}_{1}=\phi_{1}(\bm{\mathnormal{x}}) and z2\mathnormal{z}_{2} represent the output of feature extractors ϕ1\phi_{1} and ϕ2\phi_{2}, respectively. As the OOD input vectors do not have the same desirable features as the target input vectors, the mean of z1\mathnormal{z}_{1} is zero. On the other hand, because the OOD data have the same distribution of undesirable feature as the target data, the mean of z2\mathnormal{z}_{2} is non-zero. In addition, considering that the feature extractors are trained to respond sensitively to the target input vectors, the output variance of the undesirable feature extractor is much larger than that of the desirable feature extractor for OOD data, especially in the high-dimensional case. q\mathnormal{q} represents the unknown label associated with the undesirable feature, and κ\kappa is a non-negative constant that represents the degree of correlation between the undesirable feature and the unknown label.

We can prove the following theorems for a logistic regression model that is not strongly dependent on the desirable feature.

Theorem 4.

The OOD data barely affect the gradient update of the weight associated with the desirable feature.

Proof.

Because we assumed a linear classifier that is not strongly dependent on the desirable feature owing to the bias of CNNs or insufficient training data, 𝒘𝒛=w1z1+w2z2\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}}=\mathnormal{w}_{1}\mathnormal{z}_{1}+\mathnormal{w}_{2}\mathnormal{z}_{2} can be approximated by w2z2\mathnormal{w}_{2}\mathnormal{z}_{2}. Then, for the target value t\mathnormal{t}, we specified the feature 𝒛\bm{\mathnormal{z}} as shown in the following equation:

𝔼𝒛[w1]=𝔼𝒛[(σ(𝒘𝒛)t)z1]𝔼𝒛[(σ(w2z2)t)z1]=𝔼𝒛[σ(w2z2)t]𝔼𝒛[z1]=0.\begin{gathered}\mathbb{E}_{\bm{\mathnormal{z}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\mathnormal{t})\mathnormal{z}_{1}\right]\\ \approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})-\mathnormal{t})\mathnormal{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})-\mathnormal{t}\right]\mathbb{E}_{\bm{\mathnormal{z}}}\left[\mathnormal{z}_{1}\right]=0.\end{gathered} (19)

Note that in Equation (19), the gradient of the loss function with respect to the weight value w1\mathnormal{w}_{1} is zero regardless of t\mathnormal{t}. The optimal tt is then a value that makes w2\mathnormal{w}_{2} converge to zero, thereby removing the influence of the undesirable feature.

Theorem 5.

When t=0.5\mathnormal{t}=0.5, the standard learning on the OOD data leads to the weight value corresponding to the undesirable feature converging to 0.

Proof.
𝔼𝒛[w2]=𝔼𝒛[(σ(𝒘𝒛)12)z2]𝔼𝒛[(σ(w2z2)12)z2].\mathbb{E}_{\bm{\mathnormal{z}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{2}}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}})-\frac{1}{2})\mathnormal{z}_{2}\right]\approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})-\frac{1}{2})\mathnormal{z}_{2}\right]. (20)

When w2>0\mathnormal{w}_{2}>0, z2σ(w2z2)>12z2\mathnormal{z}_{2}\cdot\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})>\frac{1}{2}\mathnormal{z}_{2} with high probability. Hence,

𝔼𝒛[(σ(w2z2)12)z2]>0,wherew2>0.\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})-\frac{1}{2})\mathnormal{z}_{2}\right]>0,\quad\textup{where}\;\mathnormal{w}_{2}>0. (21)

Similarly,

𝔼𝒛[(σ(w2z2)12)z2]<0,wherew2<0.\mathbb{E}_{\bm{\mathnormal{z}}}\left[(\sigma(\mathnormal{w}_{2}\mathnormal{z}_{2})-\frac{1}{2})\mathnormal{z}_{2}\right]<0,\quad\textup{where}\;\mathnormal{w}_{2}<0. (22)

Appendix C Further Related Works

Using Unlabeled Data to Improve Adversarial Robustness

Stanforth et al. (2019) and Carmon et al. (2019) analyzed the requirement of larger sample complexity for adversarially robust generalization (Schmidt et al., 2018). Based on a model described in a prior work (Schmidt et al., 2018), they theoretically proved that unlabeled data can alleviate the need for the large sample complexity of robust generalization. Therefore, they proposed a semi-supervised learning technique by augmenting the training dataset with extra unlabeled data. They used pseudo-labels obtained from a model trained with the existing training dataset. They experimented on the CIFAR-10 dataset by using the 80 Million Tiny Images dataset (Torralba et al., 2008) (80M-TI) as an extra training dataset, resulting in improvements of adversarial robustness of the model. Najafi et al. (2019) also used unlabeled data for adversarial robustness by extending the distributionally robust learning (Ben-Tal et al., 2013) to semi-supervised learning scenarios. Instead of using pseudo-labels, they used soft-labels, which are chosen from a set of labels and softened according to the loss values of data. Their results include experiments on the MNIST, CIFAR-10, and SVHN datasets.

Augustin et al. (2020)

Comparison with Bad GAN

Dai et al. (2017) theoretically showed that a perfect generator cannot able to enhance the generalization performance, and good semi-supervised learning actually requires a bad generator. Through theoretical analysis, they proposed an empirical formulation to generate samples with low-density in the input space. Compared to our method, there are two main differences between Bad GAN and OAT. First, Dai et al. (2017) only considered semi-supervised learning settings, whereas OAT can also be applied to adversarial settings. Second, data augmentation using GAN has clear limitations. Dai et al. (2017) penalized high-density samples to generate low-density samples in the input space, but this method is inefficient in a high-dimensional input space. On the other hand, OAT uses various and efficient OOD directly to regularize the model, and the research direction that develops from the use of synthetic data to the use of real data can also be found for OOD detection (Hendrycks et al., 2018).

Appendix D Sourcing OOD Datasets

We created OOD datasets from 80M-TI, using the work of Carmon et al. (2019) for CIFAR10 and CIFAR100, respectively. In other words, we trained a 11-way (1 more class for OOD) classifier on a training set consisting of the CIFAR10 dataset and 1M randomly sampled images from 80M-TI with keywords that did not appear in CIFAR10. We then applied the classifier on 80M-TI and sorted the images based on confidence in the OOD class. The 1M and 5M images selected in the order of the highest confidence were used for OAT-A and OAT-S, respectively. In Table 3, Fusion has a batch size of targetn2+pseudo-labeledn4+OODn4\textup{target}\frac{n}{2}+\textup{pseudo-labeled}\frac{n}{4}+\textup{OOD}\frac{n}{4}, which would be compared with the Pseudo-label model that is trained with a batch size of targetn2+pseudo-labeledn2\textup{target}\frac{n}{2}+\textup{pseudo-labeled}\frac{n}{2}. In addition, we resized (using a bilinear interpolation) ImageNet to dimensions of 64×6464\times 64 and 160×160160\times 160 and divided it into datasets containing 10 and 990 classes, respectively; these are called ImgNet10 (train set size = 9894 and test set size = 3500) and ImgNet990, respectively. The classes were divided based on the Imagenette dataset (Howard, ), and the experimental results for the differently divided dataset (Imagewoof) can be seen in Appendix F. We increased the number of images of ImgNet990 by 10 times through random cropping and created the OOD datasets in the same process as 80M-TI. Furthermore, we resized (bilinear interpolation) Places365 (Zhou et al., 2017) and VisDA17 (Peng et al., 2017) for the experiments on ImgNet10 and cropped Simpson Characters (Simpson) (Attia, 2018) and Fashion product (Fashion) (Aggarwal, 2018) to dimensions of 32×3232\times 32 for the experiments on CIFAR.

Table 4: Implementation details for the experiments of OAT-A
Target architecture α\alpha training steps batch size
CIFAR WRN-34-10 (Zagoruyko & Komodakis, 2016) 1.0 80K 128
ImgNet10 ResNet18 (He et al., 2016) 1.0 15.4K 128
Table 5: Implementation details for the experiments of OAT-S
Target N architecture α\alpha training steps batch size
CIFAR 2500 ResNet18 1.0 4000 128
50K 78K 128
ImgNet10 (64 x 64) 100 WRN-22-10 1.0 200 100
9894 0.2 15.4K 128
ImgNet10 (160 x 160) 100 ResNet18 1.0 400 100
9894 0.1 38.5K 128

Appendix E Implementation Details

For all the experiments except TRADES and OATTRADES, the initial learning rate is set to 0.1. The learning rate is multiplied by 0.1 at 50% and 75% of the total training steps, and the weight decay factor is set to 2e42\mathrm{e}{-4}. We use the same adversarial perturbation budget ϵ=8\epsilon=8, as in Madry et al. (2017). We recorded the maximum adversarial robustness of the models on the test set after the first learning rate decay in adversarial training and the empirical upper bound of the test accuracy of the models during the standard learning. The other details are summarized in Tables 4 and 5. For TRADES and OATTRADES, we deploy a batch size of 64 and train the models using the same configurations as Zhang et al. (2019).

Appendix F Further Discussion about the Effectiveness of OAT

We created a new ImgNet10 dataset with reference to Imagewoof (Howard, ) to learn more about the effects of OAT. Imagewoof is a subset of ImageNet, and all classes are dog breeds. We tested whether OAT is effective for the newly constructed ImgNet10 dataset, and the experimental results of OAT-A and OAT-S are shown in Tables 6 and 7, respectively.

Table 6: Accuracy (%) comparison of the OAT model with Standard and PGD on ImgNet10 (64×\times64) under different threat models. We show the improved results compared to the counterpart of each model in bold.
Model OOD Clean PGD20 CW20
Standard - 75.77 - -
PGD - 53.64 16.18 14.81
OATPGD ImgNet990 48.24 22.09 17.1
Places365 50.45 22.57 17.37
Table 7: Accuracy (%, mean over 5 runs) comparison of the OAT models with the baseline models. We show the improved results compared to the counterpart of each model in bold.
Target N 100 250 500 1250 2500 Full
ImgNet10 (64 x 64) Standard 19.42 25.48 31.49 47.04 57.12 75.77
OATImgNet990 22.74 30.07 35.16 48.84 58.14 75.17
OATPlaces365 22.36 30.15 33.57 48.16 57.15 75.63
ImgNet10 (160 x 160) Standard 20.03 25.36 30.69 50.69 64.82 81.22
OATImgNet990 23.64 30.05 37.09 60.01 70.39 83.16
OATPlaces365 23.09 31.74 37.22 59.45 69.90 83.04

We can see from Tables 6 and 7 that for the new ImgNet10, OAT is largely useful for generalization. However, from the results of ImgNet10(64×\times64) in Table 7, we can speculate that OAT will help in training as a regularizer only when an excessive number of features are involved in the classification. This is because OAT shows no performance improvement for the new ImgNet10(64×\times64) dataset. The new ImgNet10(64×\times64) is believed to have lost much of the categorical features, since all the classes are dog breeds, in addition to being overly downscaled. In other words, OAT is beneficial for the overfitting problem, but not for the underfitting problem. We can confirm this by artificially reducing the number of training samples to increase the features that can distinguish between each categorical sample distribution, and then observe the improvement in generalization performance by applying OAT.

Appendix G Ablation study

Table 8: Adversarial robustness (%) as hyperparameter α\alpha changes
α\alpha PGD20 CW20
1.0 57.45 52.65
2.0 58.25 52.16
3.0 58.31 52.02
4.0 59.04 51.71
5.0 59.48 51.24

Our experiments indicate that it is possible to enhance adversarial robustness by removing the contributions of non-robust features existing in additional data through OAT and transfer the consistency to in-distribution data. However, the increase in the robustness against targeted attacks appears smaller than that against untargeted attacks in our method. To gain insight into this phenomenon, we train the OAT models for various α\alpha values and investigate the difference between untargeted and targeted attacks. In Table 8, it can be seen that the greater the influence of OOD on the training process is, the higher the robustness against PGD20 and the lower the robustness against CW20 are. According to our theoretical analysis, it can be understood that there are many features used for untargeted attacks in OOD data, whereas relatively few features exist for targeted attacks. Because targeted attacks are stronger than untargeted attacks (Carlini & Wagner, 2017), the results of Table 8 provide empirical evidence for the trade-off between the transferability of adversarial perturbations and the strength of adversarial attacks.

A similar trend was reported in the study of Chan et al. (2020). The authors showed that input gradient adversarial matching (IGAM) can transfer robustness across different tasks. Their results show that IGAM-trained models have similar or higher robustness than baseline models against weak attacks, such as FGSM or low-step PGD attacks, but are vulnerable to strong attacks. The finetuning-based method can also be regarded as an attempt to increase the robustness by using other datasets, but the abovementioned phenomenon is not observed in the method proposed by Hendrycks et al. (2019). This is because the method does not transfer the robustness learned from other datasets to the target dataset but rather uses a function learned from a dataset that has a large sample complexity and a data distribution similar to that of the target data with little modification. This can be confirmed from the small number of training iterations and small learning rate involved by the fine-tuning process. Moreover, applying the same method to a dataset with a small sample size or far from the target data distribution has no effect (Chan et al., 2020).

Appendix H When UID data are available

Table 9: Error rate (%) comparison of OAT with Mixup on CIFAR10, CIFAR100, and ImgNet10. We show the best result for each target dataset in bold.
Model Target Ratio Error rate
Standard CIFAR10 - 5.54
OAT80M-TI - 4.80
Mixup 0.0 4.11
OAT80M-TI+Mixup 0.6 3.88
Standard CIFAR100 - 25.13
OAT80M-TI - 24.35
Mixup 0.0 22.63
OAT80M-TI+Mixup 0.6 21.60
Standard ImgNet10 (64 x 64) - 13.07
OATImgNet990 - 12.12
OATPlaces365 - 11.63
Mixup 0.0 11.60
OATImgNet990+Mixup 0.7 10.77
OATPlaces365+Mixup 0.7 10.39
Standard ImgNet10 (160 x 160) - 9.09
OATImgNet990 - 8.13
OATPlaces365 - 8.58
Mixup 0.0 7.04
OATImgNet990+Mixup 0.3 6.46
OATPlaces365+Mixup 0.3 6.90

We train the OAT models in conjuction with UID data as follows:

minθαinmissingE(𝒙t,y)𝒟t[max𝜹S(𝒙t+𝜹,y;θ)]+αomissingE𝒙o𝒟o[maxϵS(𝒙o+ϵ,tunif;θ)]+αUIDmissingE𝒙u𝒟UID[max𝜻S(𝒙u+𝜻,ypseudo;θ)].\begin{gathered}\min_{\theta}\alpha_{\textup{in}}\mathop{\mathbb{missing}}{E}_{(\bm{\mathnormal{x}}_{t},y)\in\mathcal{D}_{t}}\left[\max_{\mathbf{\bm{\delta}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{t}+\bm{\delta},\mathnormal{y};\theta{})\right]+\alpha_{\textup{o}}\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{o}\in\mathcal{D}_{o}}\left[\max_{\mathbf{\bm{\epsilon}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{o}+\bm{\epsilon},\mathnormal{t}_{\textup{unif}};\theta{})\right]\\ +\alpha_{\textup{UID}}\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{u}\in\mathcal{D}_{\textup{UID}}}\left[\max_{\mathbf{\bm{\zeta}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{u}+\bm{\zeta},\mathnormal{y}_{\textup{pseudo}};\theta{})\right].\end{gathered} (23)

In the experiment of Figure 1(a), the Pseudo-label models are trained via a PGD-based approach, and every batch consists of 64 CIFAR-10 data and 64 pseudo-labeled data. The OAT+Pseudo-label models are also trained via a PGD-based approach, and every batch consists of 64 CIFAR-10 data, 32 pseudo-labeled data, and 32 OOD data (80M-TI). The hyperparameters (αin,αo,αUID\alpha_{\textup{in}},\alpha_{\textup{o}},\alpha_{\textup{UID}}) are set to (13,13,13\frac{1}{3},\frac{1}{3},\frac{1}{3}). All other conditions are the same as described in Appendix E. In the experiment of Figure 1(b), we train the models with the same configurations as Carmon et al. (2019), but every batch for the OAT+RST model comprises 128 pseudo-labeled data, 64 CIFAR-10 data, and 64 OOD data (the RST model has a batch size of 256). The hyperparameters (αin,αo,αUID\alpha_{\textup{in}},\alpha_{\textup{o}},\alpha_{\textup{UID}}) are set to (0.25, 0.25, 0.5).

Appendix I Experimental results for Mixup

Zhang et al. (2017) proposed a data augmentation method named Mixup. Mixup generates training examples as x~=αxi+(1α)xj\tilde{\mathnormal{x}}=\alpha\mathnormal{x}_{i}+(1-\alpha)\mathnormal{x}_{j} and y~=αyi+(1α)yj\tilde{\mathnormal{y}}=\alpha\mathnormal{y}_{i}+(1-\alpha)\mathnormal{y}_{j}, where (xi,yi)(\mathnormal{x}_{i},\mathnormal{y}_{i}) and (xj,yj)(\mathnormal{x}_{j},\mathnormal{y}_{j}) are two examples drawn at random from the training data, and α[0,1]\alpha\in[0,1]. Here, we show that even outside of the convex combination of training data can be effectively regularized using OOD data. Table 9 shows that the models trained with the combination of OAT and Mixup always output the highest accuracy. The OAT+Mixup models are trained with the following algorithm:

Algorithm 2 OAT+Mixup
0:  Ratio γ\gamma, Target dataset 𝒟t\mathcal{D}_{t}, OOD dataset 𝒟o\mathcal{D}_{o}, uniform distribution label tunift_{\textup{unif}}, batch size nn, training iterations TT, learning rate τ\tau
1:  for t=1t=1 to TT do
2:     (Xt,Yt)=SAMPLE(dataset=𝒟t,size=n)(X_{t},Y_{t})=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{t},\textup{size}=n)
3:     Xo=SAMPLE(dataset=𝒟o,size=nγ)X_{o}=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{o},\textup{size}=\lfloor n*\gamma\rfloor)
4:     (X¯t,Y¯t)PERMUTE(Xt,Yt)(\bar{X}_{t},\bar{Y}_{t})\leftarrow\textup{PERMUTE}(X_{t},Y_{t})
5:     (X¯t[0:nγ],Y¯t[0:nγ])(Xo,tunif)(\bar{X}_{t}\left[0:\lfloor n*\gamma\rfloor\right],\bar{Y}_{t}\left[0:\lfloor n*\gamma\rfloor\right])\leftarrow(X_{o},t_{\textup{unif}})
6:     (X,Y)MIXUP(Xt,Yt,X¯t,Y¯t)(X,Y)\leftarrow\textup{MIXUP}(X_{t},Y_{t},\bar{X}_{t},\bar{Y}_{t})
7:     model update:
8:     𝜽𝜽τθAVERAGE((X,Y;𝜽))\bm{\theta}\leftarrow\bm{\theta}-\tau\cdot\nabla_{\theta}\textup{AVERAGE}(\mathcal{L}(X,Y;\bm{\theta}))
9:  end for
10:  Output: trained model parameter 𝜽\bm{\theta}