Removing Undesirable Feature Contributions Using Out-of-Distribution Data

Saehyung Lee, Changhwa Park, Hyungyu Lee, Jihun Yi, Jonghyun Lee, Sungroh Yoon
Electrical and Computer Engineering, AIIS, ASRI, INMC, and Institute of Engineering Research
Seoul National University
Seoul 08826, South Korea
{halo8218,omega6464,rucy74,t080205,leejh9611,sryoon}@snu.ac.kr
Correspondence to: Sungroh Yoon [email protected].

Abstract

Several data augmentation methods deploy unlabeled-in-distribution (UID) data to bridge the gap between the training and inference of neural networks. However, these methods have clear limitations in terms of availability of UID data and dependence of algorithms on pseudo-labels. Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues. We show how to improve generalization theoretically using OOD data in each learning scenario and complement our theoretical analysis with experiments on CIFAR-10, CIFAR-100, and a subset of ImageNet. The results indicate that undesirable features are shared even among image data that seem to have little correlation from a human point of view. We also present the advantages of the proposed method through comparison with other data augmentation methods, which can be used in the absence of UID data. Furthermore, we demonstrate that the proposed method can further improve the existing state-of-the-art adversarial training.

1 Introduction

The power of the enormous amount of data suggested by the empirical risk minimization (ERM) principle (Vapnik & Vapnik, 1998) has allowed deep neural networks (DNNs) to perform outstandingly on many tasks, including computer vision (Krizhevsky et al., 2012) and natural language processing (Hinton et al., 2012). However, most of the practical problems encountered by DNNs have high-dimensional input spaces, and nontrivial generalization errors arise owing to the curse of dimensionality (Bellman, 1961). Moreover, neural networks have been found to be easily deceived by adversarial perturbations with a high degree of confidence (Szegedy et al., 2013). Several studies (Goodfellow et al., 2014; Krizhevsky et al., 2012) have been conducted to address these generalization problems resulting from ERM. Most of them handled the generalization problems by extending the training distribution (Madry et al., 2017; Lee et al., 2020). Nevertheless, it has been demonstrated that more data are needed to achieve better generalization (Schmidt et al., 2018). Recent methods (Carmon et al., 2019; Xie et al., 2019) introduced unlabeled-in-distribution (UID) data to compensate for the lack of training samples. However, there are limitations associated with these methods. First, obtaining suitable UID data for selected classes is challenging. Second, when applying supervised learning methods on pseudo-labeled data, the effect of data augmentation depends heavily on the accuracy of the pseudo-label generator.

In our study, in order to break through the limitations outlined above, we propose an approach that promotes robust and standard generalization using out-of-distribution (OOD) data. Especially, motivated by previous studies demonstrating the existence of common adversarial space among different images or even datasets (Naseer et al., 2019; Poursaeed et al., 2018), we show that OOD data can be leveraged for adversarial learning. Likewise, if the OOD data share the same undesirable features as those of the in-distribution data in terms of standard generalization, they can be leveraged for standard learning. By definition, in this work, the classes of the OOD data differ from those of the in-distribution data, and our method do not use the label information of the OOD data. Therefore the proposed method is free from the previously mentioned problems caused by UID data. We present a theoretical model which demonstrates how to improve generalization using OOD data in both adversarial and standard learning. In our theoretical model, we separate desirable and undesirable features and show how the training on OOD data, which shares undesirable features with in-distribution data, changes the weight values of the classifier. Based on the theoretical analysis, we introduce out-of-distribution data augmented training (OAT), which assigns a uniform distribution label to all the OOD data samples to remove the influence of undesirable features in adversarial and standard learning. In the proposed method, each batch is composed of training data and OOD data, and OOD data regularize the training so that only features that are strongly correlated with class labels are learned. We complement our theoretical findings with experiments on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and a subset of ImageNet (Deng et al., 2009). In addition, we present the empirical evidence for the transferability of undesirable features through further studies on various datasets including Simpson Characters (Attia, 2018), Fashion Product Images (Aggarwal, 2018), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2017), and VisDA-17 (Peng et al., 2017).

Contributions

(i) We propose a simple method, out-of-distribution data augmented training (OAT), to leverage OOD data for adversarial and standard learning, and theoretical analyses demonstrate how our proposed method can improve robust and standard generalization. (ii) The results of experimental procedures on CIFAR-10, CIFAR-100, and a subset of ImageNet suggest that OAT can help reduce the generalization gap in adversarial and standard learning. (iii) By applying OAT using various OOD datasets, it is shown that undesirable features are shared among diverse image datasets. It is also demonstrated that OAT can effectively extend training distribution by comparison with other data augmentation methods that can be employed in the absence of UID data. (iv) The state-of-the-art adversarial training method using UID data is found to further improve by incorporating the proposed method of leveraging OOD data.

2 Background

Undesirable features in adversarial learning

Tsipras et al. (2018) demonstrated the existence of a trade-off between standard accuracy and adversarial robustness with the distinction between robust and non-robust features. They showed the possibility that adversarial robustness is incompatible with standard accuracy by constructing a binary classification task that the data model consists of input-label pairs $(\bm{\mathnormal{x}},\mathnormal{y})\in\mathbb{R}^{d+1}\times\{\pm 1\}$ sampled from a distribution as follows:

\mathnormal{y}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\mathnormal{x}_{1}=\begin{cases}+\mathnormal{y}&\textup{w.p.}\;\mathnormal{p}\\ -\mathnormal{y}&\textup{w.p.}\;1-\mathnormal{p}\end{cases},\quad\mathnormal{x}_{2},\dots,\mathnormal{x}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\epsilon\mathnormal{y},1).

(1)

Here, $\mathnormal{x}_{1}$ is a robust feature that is strongly correlated with the label, and the other features $\mathnormal{x}_{2},\dots,\mathnormal{x}_{d+1}$ are non-robust features that are weakly correlated with the label. $\epsilon$ is small but sufficiently large such that a simple classifier attains a high standard accuracy, and $\mathnormal{p}\geq 0.5$ . To characterize adversarial robustness, the definitions of expected standard loss $\beta_{s}$ and expected adversarial loss $\beta_{a}$ for a data distribution $D$ are defined as follows:

\beta_{s}=\mathop{\mathbb{E}}_{(\bm{\mathnormal{x}},\mathnormal{y})\mathtt{\sim}D}\left[\mathcal{L}(\bm{\mathnormal{x}},\mathnormal{y};\theta{})\right],\quad\beta_{a}=\mathop{\mathbb{E}}_{(\bm{\mathnormal{x}},\mathnormal{y})\mathtt{\sim}D}\left[\max_{\mathbf{\bm{\delta}}\in S}\mathcal{L}(\bm{\mathnormal{x}+\delta},\mathnormal{y};\theta{})\right].

(2)

Here, $\mathcal{L}(;\theta)$ is the loss function of the model, and $S$ represents the set of perturbations that the adversary can apply to deceive the model. For Equation (1), Tsipras et al. (2018) showed that the following classifier can yield a small expected standard loss:

f_{\textup{avg}}(\bm{\mathnormal{x}})=\textup{sign}(\bm{\mathnormal{w}}^{\top}_{\textup{unif}}\bm{\mathnormal{x}}),\quad\textup{where}\;\bm{\mathnormal{w}}_{\textup{unif}}=\left[0,\frac{1}{d},\dots,\frac{1}{d}\right].

(3)

They also proved that the classifier is vulnerable to adversarial perturbations, and that adversarial training results in a classifier that assigns zero weight values to non-robust features.

Transferability of Adversarial Perturbations

Naseer et al. (2019) produced domain-agnostic adversarial perturbations, thereby showing common adversarial space among different datasets. They showed that an adversarial function trained on Paintings, Cartoons or Medical data can deceive the classifier on ImageNet data with a high success rate. The study findings show that even datasets from considerably different domains share non-robust features. Therefore a method for supplementing the data needed for adversarial training is presented herein.

Undesirable features in standard learning

Wang et al. (2020) noted that convolutional neural networks (CNN) can capture high-frequency components in images that are almost imperceptible to a human. This ability is thought to be closely related to the generalization behaviors of CNNs, especially the capacity in memorizing random labels. Several studies (Geirhos et al., 2018; Bahng et al., 2019) reported that CNNs are biased towards local image features, and the generalization performance can be improved by regularizing that bias. In this context, a method of regularizing undesirable feature contributions using OOD data is proposed, assuming that undesirable features arise from the bias of CNNs or insufficient training data and are widely distributed in the input space.

3 Methods

3.1 Theoretical motivation

In this section, we analyze theoretically how OOD data can be used to make up for the insufficient training samples in adversarial training based on the dichotomy between robust and non-robust features. The theoretical motivation of using OOD data to reduce the contribution of undesirable features in standard learning can be found in Appendix B.

Setup and overview

We denote in-distribution data as target data. Given a target dataset $\{(\tilde{\bm{\mathnormal{x}}}^{i},{\mathnormal{y}}^{i})\}_{i=1}^{n}\subset\mathcal{X}\times\{\pm 1\}$ sampled from a data distribution $\tilde{D}$ , where $\mathcal{X}$ is the input space, we suppose that a feature extractor $\Phi:\mathcal{X}\to\mathcal{Z}\subset\mathbb{R}^{d}$ and a linear classification model are trained on $\{(\tilde{\bm{\mathnormal{x}}}^{i},{\mathnormal{y}}^{i})\}_{i=1}^{n}$ and target feature-label pairs $\{(\Phi(\tilde{\bm{\mathnormal{x}}}^{i}),{\mathnormal{y}}^{i})\}_{i=1}^{n}$ , respectively, to yield the small expected standard loss of the classification model. We then define an OOD dataset $\{\hat{\bm{\mathnormal{x}}}^{i}\}_{i=1}^{m}\subset\mathcal{X}$ sampled from a data distribution $\hat{D}$ that has the same distribution of non-robust features as that of $\tilde{D}$ with reference to the preceding studies (Naseer et al., 2019; Moosavi-Dezfooli et al., 2017). After fixing $\Phi$ to facilitate the theoretical analysis on this framework, we demonstrate how adversarial training on the OOD dataset affects the weight values of our classifier.

Our data model

The feature extractor $\Phi$ can be considered to consist of several feature extractors $\phi:\mathcal{X}\to\mathcal{Z}\subset\mathbb{R}$ . Hence, we can set the distributions of the target feature-label pair $(\Phi(\tilde{\bm{\mathnormal{x}}}),\mathnormal{y})=(\tilde{\bm{\mathnormal{z}}},\mathnormal{y})\in\mathbb{R}^{d}\times\{\pm 1\}$ and the OOD feature vector $\Phi(\hat{\bm{\mathnormal{x}}})=\hat{\bm{\mathnormal{z}}}\in\mathbb{R}^{d}$ as follows:

\begin{gathered}\mathnormal{y}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\tilde{\mathnormal{z}}_{1}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(y,\mathnormal{u}^{2}),\quad\tilde{\mathnormal{z}}_{2},\dots,\tilde{\mathnormal{z}}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\eta\mathnormal{y},1),\\ \mathnormal{q}\stackrel{{\scriptstyle u.a.r}}{{\sim}}\{-1,+1\},\quad\hat{\mathnormal{z}}_{1}\stackrel{{\scriptstyle}}{{\sim}}\mathcal{N}(0,\mathnormal{v}^{2}),\quad\hat{\mathnormal{z}}_{2},\dots,\hat{\mathnormal{z}}_{d+1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\eta\mathnormal{q},1),\end{gathered}

(4)

where $u.a.r$ stands for uniformly at random. From here on, we will only deal with the OOD data, therefore the accents (tilde and caret) that distinguish between target data and OOD data are omitted. In Equation (4), the feature $\mathnormal{z}_{1}=\phi_{1}(\bm{\mathnormal{x}})$ is the output of a robust feature extractor $\phi_{1}$ , and the other features $\mathnormal{z}_{2},\dots,\mathnormal{z}_{d+1}$ are those of non-robust feature extractors $\phi_{2},\dots,\phi_{d+1}$ . Since the OOD input vectors do not have the same robust features as the target input vectors, $\mathnormal{z}_{1}$ has zero mean and a small variance. Furthermore, because the OOD data have the same distribution of non-robust features as the target data, $\mathnormal{z}_{2},\dots,\mathnormal{z}_{d+1}$ have a non-zero mean and a larger variance than the robust features. In addition, $\mathnormal{q}$ represents the unknown label associated with the non-robust features, and $\eta$ is a non-negative constant which represents the degree of correlation between the non-robust features and the unknown label. Please note that the input space of our classifier is the output space of $\Phi$ in our data model. Therefore, $\eta$ is not limited to a small value even in the context of the $\ell_{p}\text{-}$ bounded adversary. Rather, the high degree of confidence that DNNs show for the adversarial examples (Goodfellow et al., 2014) suggests that $\eta$ is large.

Our linear classification model

According to Section 2, we know that our linear classification model (logistic regression), defined as follows, yields a low expected standard loss while demonstrating high adversarial vulnerability.

p(\mathnormal{y}=+1\mid\bm{\mathnormal{z}})=\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}}),\;p(\mathnormal{y}=-1\mid\bm{\mathnormal{z}})=1-\sigma(\bm{\mathnormal{w}}^{\top}\bm{\mathnormal{z}}),\quad\textup{where}\;\bm{\mathnormal{w}}=\left[0,\frac{1}{d},\dots,\frac{1}{d}\right].

(5)

To observe the effect of adversarial training on the OOD dataset, we train our classifier by applying the stochastic gradient descent algorithm to the cross-entropy loss function $\mathcal{L}(;\bm{\mathnormal{w}})$ .

Firstly, we construct the adversarial feature vector $\bar{\bm{\mathnormal{z}}}=\Phi(\bm{\mathnormal{x}}+\bm{\delta}):\bm{\delta}\in S$ against our classifier for adversarial training.

Theorem 1.

Let $\mathnormal{t}\in[0,1]$ be the given target value of the feature vector $\bm{\mathnormal{z}}$ in our classification model, and $\lambda$ be a non-negative constant. Then, when $\mathnormal{t}=0.5$ , the expectation of the adversarial feature vector is

\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{1}\right]=\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{1}\right],\quad\mathbb{E}_{\bm{\mathnormal{z}}}\left[\bar{z}_{k}\right]\approx\mathbb{E}_{\bm{\mathnormal{z}}}\left[z_{k}\right]+\lambda\cdot\mathnormal{q},\quad\textup{where}\;k\in\{2,\dots,d+1\}.

(6)

(All the proofs of the theorems in this paper can be found in Appendix A.) Here, we assume the $\ell_{\infty}\text{-}$ bounded adversary. In Theorem 1, we can observe that the adversary pushes the non-robust features farther in the direction of the unknown label $\mathnormal{q}$ , which coincides with our intuition. When the given target value $\mathnormal{t}$ is $0.5$ , the adversary will make our classification model output equal to zero or one to yield a large loss.

Our classification model is trained on the adversarial features shown in Theorem 1.

Theorem 2.

When $\mathnormal{t}=0.5$ , the expected gradient of the loss function $\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}})$ with respect to the weight vector $\bm{\mathnormal{w}}$ of our classification model is

\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx 0,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}.

(7)

Thus, the adversarial training with $\mathnormal{t}=0.5$ on the OOD dataset leads to the weight values corresponding to the non-robust features converging to zero while preserving $\mathnormal{w}_{1}$ from the gradient update. This shows that we can reduce the impact of non-robust features using the OOD dataset. We, however, should not only reduce the influence of the non-robust features, but also improve the classification accuracy using the robust feature. This can be achieved through the adversarial training on the target dataset. Accordingly, we show the effect of the adversarial training on the OOD dataset when $\mathnormal{w}_{1}>0$ in our example.

Theorem 3.

When $\mathnormal{t}=0.5$ and $\mathnormal{w}_{1}>0$ , the expected gradient of the loss function $\mathcal{L}(\bar{\bm{\mathnormal{z}}},\mathnormal{t};\bm{\mathnormal{w}})$ with respect to the weight vector $\bm{\mathnormal{w}}$ of our classification model is

\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{1}}\right]\approx\frac{1}{2}\lambda,\quad\mathbb{E}_{\bar{\bm{\mathnormal{z}}}}\left[\frac{\partial\mathcal{L}}{\partial\mathnormal{w}_{k}}\right]\approx\frac{1}{2}(\eta+\lambda),\quad\textup{where}\;k\in\{2,\dots,d+1\}.

(8)

Theorem 3 shows that when $\mathnormal{w}_{1}>0$ , the adversarial training with $\mathnormal{t}=0.5$ on the OOD dataset reduces the influence of all the features in $\mathcal{Z}$ . However, we can see that the expected gradients for the weight values associated with the non-robust features are always greater than the expected gradient for the weight value associated with the robust feature. In addition, the greater the value of $\eta$ , the faster the weight value associated with it converges to zero. This means that the contribution of non-robust features with high influence decreases rapidly. In the case of multiclass classification, it is straightforward that $\mathnormal{t}=0.5$ corresponds to uniform distribution label $\mathnormal{t}_{\textup{unif}}=[\frac{1}{c},\dots,\frac{1}{c}]$ , where $\mathnormal{c}$ is the number of the classes. Intuitively, the meaning of ( $\mathbf{\mathnormal{x}},\mathnormal{t}_{\textup{unif}}$ ) is that the input $\mathbf{\mathnormal{x}}$ lies on the decision boundary. To reduce the training loss for ( $\mathbf{\mathnormal{x}},\mathnormal{t}_{\textup{unif}}$ ), the classifier will learn that the features of $\mathbf{\mathnormal{x}}$ do not contribute to a specific class, which can be understood as removing the contributions of the features of $\mathbf{\mathnormal{x}}$ .

3.2 Out-of-distribution data augmented training

Based on our theoretical analysis, we introduce Out-of-distribution data Augmented Training (OAT). OAT is the training on the union of the target dataset $\mathcal{D}_{t}$ and the OOD dataset $\mathcal{D}_{o}$ . When applying our proposed method, we need to consider the following two points: 1) Temporary labels associated with OOD data samples are required for supervised learning, and 2) the loss functions corresponding to $\mathcal{D}_{t}$ and $\mathcal{D}_{o}$ should be properly combined.

First, we assign a uniform distribution label $\mathnormal{t}_{\textup{unif}}$ to all the OOD data samples as confirmed in our theoretical analysis. This labeling method enables us to leverage OOD data for supervised learning at no extra cost. Moreover, it means that our method is completely free from the limitations of the methods using UID data (see Section 1).

Second, although OOD data can be used to improve the standard and robust generalization of neural networks, the training on target data is essential to enhance the classification accuracy of neural networks. In addition, according to Theorem 3, adversarial training on the pairs of OOD data samples and $\mathnormal{t}_{\textup{unif}}$ affects the weight for robust features as well as that for non-robust features. Hence, the balance between losses from $\mathcal{D}_{t}$ and $\mathcal{D}_{o}$ is important in OAT. For this reason, we introduce a hyperparameter $\alpha\in\mathbb{R}^{+}$ into our proposed method and train neural networks as follows:

\begin{gathered}\textup{OAT-A:}\quad\min_{\theta}\mathop{\mathbb{missing}}{E}_{(\bm{\mathnormal{x}}_{t},y)\in\mathcal{D}_{t}}\left[\max_{\mathbf{\bm{\delta}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{t}+\bm{\delta},\mathnormal{y};\theta{})\right]+\alpha\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{o}\in\mathcal{D}_{o}}\left[\max_{\mathbf{\bm{\epsilon}}\in S}\mathcal{L}(\bm{\mathnormal{x}}_{o}+\bm{\epsilon},\mathnormal{t}_{\textup{unif}};\theta{})\right],\\ \textup{OAT-S:}\quad\min_{\theta}\mathop{\mathbb{missing}}{E}_{(\bm{\mathnormal{x}}_{t},y)\in\mathcal{D}_{t}}\left[\mathcal{L}(\bm{\mathnormal{x}}_{t},\mathnormal{y};\theta{})\right]+\alpha\mathop{\mathbb{missing}}{E}_{\bm{\mathnormal{x}}_{o}\in\mathcal{D}_{o}}\left[\mathcal{L}(\bm{\mathnormal{x}}_{o},\mathnormal{t}_{\textup{unif}};\theta{})\right].\end{gathered}

(9)

Here, OAT-A and OAT-S represent OOD data augmented adversarial and standard learning, respectively. The pseudo-code for the overall procedure of our method is presented in Algorithm 1.

Algorithm 1 Out-of-distribution augmented Training (OAT)

0: Target dataset

\mathcal{D}_{t}

, OOD dataset

\mathcal{D}_{o}

, uniform distribution label

t_{\textup{unif}}

, batch size

n

, training iterations

T

, learning rate

\tau

, hyperparameter

\alpha

, adversarial attack function

\mathcal{G}

1: for

t=1

T

(X_{t},Y)=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{t},\textup{size}=\frac{n}{2})

X_{o}=\textup{SAMPLE}(\textup{dataset}=\mathcal{D}_{o},\textup{size}=\frac{n}{2})

4: if Adversarial Learning then

\left[\bar{X}_{t},\bar{X}_{o}\right]\leftarrow\mathcal{G}(\left[X_{t},X_{o}\right],\left[Y,t_{\textup{unif}}\right];\bm{\theta})

6: else if Standard Learning then

\left[\bar{X}_{t},\bar{X}_{o}\right]\leftarrow\left[X_{t},X_{o}\right]

8: end if

9: model update:

10:

\bm{\theta}\leftarrow\bm{\theta}-\tau\cdot\nabla_{\theta}\textup{AVERAGE}(\frac{1}{2}\mathcal{L}(\bar{X}_{t},Y;\bm{\theta})+\frac{\alpha}{2}\mathcal{L}(\bar{X}_{o},t_{\textup{unif}};\bm{\theta}))

11: end for

12: Output: trained model parameter

\bm{\theta}

4 Related studies

Adversarial examples

Many adversarial attack methods have been proposed, including the projected gradient descent (PGD) and Carlini & Wagner (CW) attacks (Madry et al., 2017; Carlini & Wagner, 2017). The PGD attack employs an iterative procedure of the fast gradient sign method (FGSM) (Goodfellow et al., 2014) to find worst-case examples in which the training loss is maximized. CW attack finds adversarial examples using CW losses instead of cross-entropy losses. Recently, Croce & Hein (2020) proposed autoattack (AA), which is a powerful ensemble attack with two extensions of the PGD attack and two existing attacks (Croce & Hein, 2019; Andriushchenko et al., 2019). To defend against these adversarial attacks, various adversarial defense methods have been developed (Goodfellow et al., 2014; Kannan et al., 2018). Madry et al. (2017) introduced adversarial training that uses adversarial examples as training data, and Zhang et al. (2019) proposed TRADES to optimize a surrogate loss which is a sum of the natural error and boundary error.

OOD detection

Lee et al. (2017) and Hendrycks et al. (2018) dealt with the overconfidence problem of the confidence score-based OOD detectors. They used uniform distribution labels as in our method to resolve the overconfidence issue. In particular, Augustin et al. (2020) addressed the overconfidence problem in adversarial settings. The authors proposed a method that is practically identical to OAT-A by combining adversarial training with ACET (Hein et al., 2019). They did not, however, address the generalization problems of neural networks. Specifically, our theoretical results allow us to explain the classification performance improvement of classifiers that were only considered as secondary effects in the abovementioned studies. Further related works can be found in Appendix C.

5 Experimental results and discussion

5.1 Experimental setup

OOD datasets

We created OOD datasets from the 80 Million Tiny Images dataset (Torralba et al., 2008) (80M-TI), using the work of Carmon et al. (2019) for CIFAR-10 and CIFAR-100, respectively. In addition, we resized (using a bilinear interpolation) ImageNet to dimensions of $64\times 64$ and $160\times 160$ and divided it into datasets containing 10 and 990 classes, respectively; these are called ImgNet10 and ImgNet990. Furthermore, we resized Places365 and VisDA-17 for the experiments on ImgNet10 and cropped the Simpson Characters (Simpson), and Fashion Product (Fashion) datasets to dimensions of $32\times 32$ for the experiments on CIFAR10 and CIFAR100. The details in sourcing the OOD datasets can be found in Appendix D.

Implementation details

Implementation details including the hyperparameter $\alpha$ , architectures, batch sizes, and training iterations are summarized in Appendix E. All models compared in the same target dataset are trained in the same training batch size and iterations to ensure a fair comparison. In other words, OAT have a batch size of $\textup{target}\frac{n}{2}+\textup{OOD}\frac{n}{2}$ , which would be compared with a model that is normally trained with a batch size of $n$ . To evaluate the adversarial robustness of the models in our experiments, we apply several adversarial attacks including PGD, CW, and AA. Note that we denote PGD and CW attacks with $T$ iterative steps as PGD $T$ and CW $T$ , respectively, and the original test set as Clean. We compare the following models in our experiments¹¹1Our codes are available at https://github.com/Saehyung-Lee/OAT.:

1.

Standard: The model which is normally trained on the target dataset.
2.

PGD: The model trained using PGD-based adversarial training on the target dataset.
3.

TRADES: The model trained using TRADES on the target dataset.
4.

$\textup{OAT}_{\textup{PGD}}$ : The model which is adversarially trained with OAT based on a PGD approach.
5.

$\textup{OAT}_{\textup{TRADES}}$ : The model which is adversarially trained with OAT based on TRADES.
6.

$\textup{OAT}_{\mathcal{D}_{o}}$ : The model which is normally trained with OAT using the OOD dataset $\mathcal{D}_{o}$ .

5.2 Study on the effectiveness of OAT in adversarial learning

Table 1: Accuracy (%) comparison of the OAT model with Standard, PGD, and TRADES on CIFAR10, CIFAR100, and ImgNet10 (64

\times

64) under different threat models. We show the improved results compared to the counterpart of each model in bold.

Model	Target	OOD	Clean	PGD100	CW100	AA
Standard	CIFAR10	-	95.48	0.00	0.00	0.00
PGD		-	87.48	49.92	50.80	48.29
PGD+CutMix		-	89.35	53.39	52.35	49.05
TRADES		-	85.24	55.69	54.04	52.83
OAT_PGD		80M-TI	86.63	56.77	52.38	49.98
OAT_TRADES		80M-TI	86.76	59.66	55.71	54.63
Standard	CIFAR100	-	78.57	0.02	0.00	0.00
PGD		-	61.37	24.66	24.68	22.76
TRADES		-	58.84	30.24	27.97	26.91
OAT_PGD		80M-TI	61.54	30.02	27.85	25.36
OAT_TRADES		80M-TI	63.07	34.23	29.02	27.83
Standard	ImgNet10 (64 x 64)	-	86.03	0.11	0.06	0.00
PGD		-	82.80	48.77	48.86	48.34
OAT_PGD		ImgNet990	81.91	59.03	54.69	53.83

Refer to caption — Table 2: Comparison of OAT using various OOD datasets for improving robust generalization on CIFAR10 and ImgNet10 ( $64\times 64$ ). The None represents the baseline model (PGD).

Dataset	CIFAR10	CIFAR100
$N$	2,500 / Full	2,500 / Full
Standard	65.44 / 94.46	24.41 / 74.87
OAT_SVHN	68.56 / 94.45	24.82 / 75.65
OAT_Simpson	70.08 / 94.43	27.04 / 76.03
OAT_80M-TI	72.49 / 95.20	26.13 / 76.30
Pseudo-label	- / 95.28	- / 77.24
Fusion	- / 95.53	- / 77.36

Target	architecture	$\alpha$	training steps	batch size
CIFAR	WRN-34-10 (Zagoruyko & Komodakis, 2016)	1.0	80K	128
ImgNet10	ResNet18 (He et al., 2016)	1.0	15.4K	128

Target	N	architecture	$\alpha$	training steps	batch size
CIFAR	2500	ResNet18	1.0	4000	128
CIFAR	50K	ResNet18	1.0	78K	128
ImgNet10 (64 x 64)	100	WRN-22-10	1.0	200	100
ImgNet10 (64 x 64)	9894	WRN-22-10	0.2	15.4K	128
ImgNet10 (160 x 160)	100	ResNet18	1.0	400	100
ImgNet10 (160 x 160)	9894	ResNet18	0.1	38.5K	128

Model	OOD	Clean	PGD20	CW20
Standard	-	75.77	-	-
PGD	-	53.64	16.18	14.81
OAT_PGD	ImgNet990	48.24	22.09	17.1
OAT_PGD	Places365	50.45	22.57	17.37

Target	N	100	250	500	1250	2500	Full
ImgNet10 (64 x 64)	Standard	19.42	25.48	31.49	47.04	57.12	75.77
	OAT_ImgNet990	22.74	30.07	35.16	48.84	58.14	75.17
	OAT_Places365	22.36	30.15	33.57	48.16	57.15	75.63
ImgNet10 (160 x 160)	Standard	20.03	25.36	30.69	50.69	64.82	81.22
	OAT_ImgNet990	23.64	30.05	37.09	60.01	70.39	83.16
	OAT_Places365	23.09	31.74	37.22	59.45	69.90	83.04

Model	Target	Ratio	Error rate
Standard	CIFAR10	-	5.54
OAT_80M-TI		-	4.80
Mixup		0.0	4.11
OAT_80M-TI+Mixup		0.6	3.88
Standard	CIFAR100	-	25.13
OAT_80M-TI		-	24.35
Mixup		0.0	22.63
OAT_80M-TI+Mixup		0.6	21.60
Standard	ImgNet10 (64 x 64)	-	13.07
OAT_ImgNet990		-	12.12
OAT_Places365		-	11.63
Mixup		0.0	11.60
OAT_ImgNet990+Mixup		0.7	10.77
OAT_Places365+Mixup		0.7	10.39
Standard	ImgNet10 (160 x 160)	-	9.09
OAT_ImgNet990		-	8.13
OAT_Places365		-	8.58
Mixup		0.0	7.04
OAT_ImgNet990+Mixup		0.3	6.46
OAT_Places365+Mixup		0.3	6.90

$\alpha$	PGD20	CW20
1.0	57.45	52.65
2.0	58.25	52.16
3.0	58.31	52.02
4.0	59.04	51.71
5.0	59.48	51.24

Removing Undesirable Feature Contributions Using Out-of-Distribution Data

Abstract

1 Introduction

Contributions

2 Background

Undesirable features in adversarial learning

Transferability of Adversarial Perturbations

Undesirable features in standard learning

3 Methods

3.1 Theoretical motivation

Setup and overview

Our data model

Our linear classification model

Theorem 1.

Theorem 2.

Theorem 3.

3.2 Out-of-distribution data augmented training

4 Related studies

Adversarial examples

OOD detection

5 Experimental results and discussion

5.1 Experimental setup

OOD datasets

Implementation details

5.2 Study on the effectiveness of OAT in adversarial learning

Evaluating adversarial robustness

OAT with diverse OOD datasets

When UID data are available

5.3 Study on the effectiveness of OAT in standard learning

Randomization test

Evaluating classification performance

6 Conclusions and future directions

Acknowledgements:

References

Appendix A Proofs

Theorem 1.

Proof.

Theorem 2.

Proof.

Theorem 3.

Proof.

Appendix B The Theoretical Motivation of OAT in Standard Learning

Theorem 4.

Proof.

Theorem 5.

Proof.

Appendix C Further Related Works

Using Unlabeled Data to Improve Adversarial Robustness

Comparison with Bad GAN

Appendix D Sourcing OOD Datasets

Appendix E Implementation Details

Appendix F Further Discussion about the Effectiveness of OAT

Appendix G Ablation study

Appendix H When UID data are available

Appendix I Experimental results for Mixup