Semantic Perturbations with Normalizing Flows for Improved Generalization

Oğuz Kaan Yüksel^† Sebastian U. Stich^† Martin Jaggi^† Tatjana Chavdarova^†,‡
^† Machine Learning and Optimization Lab, EPFL
^‡ Department of Electrical Engineering and Computer Sciences, UC Berkeley

Abstract

Data augmentation is a widely adopted technique for avoiding overfitting when training deep neural networks. However, this approach requires domain-specisfic knowledge and is often limited to a fixed set of hard-coded transformations. Recently, several works proposed to use generative models for generating semantically meaningful perturbations to train a classifier. However, because accurate encoding and decoding are critical, these methods, which use architectures that approximate the latent-variable inference, remained limited to pilot studies on small datasets.

Exploiting the exactly reversible encoder-decoder structure of normalizing flows, we perform on-manifold perturbations in the latent space to define fully unsupervised data augmentations. We demonstrate that such perturbations match the performance of advanced data augmentation techniques—reaching $96.6\%$ test accuracy for CIFAR-10 using ResNet-18 and outperform existing methods, particularly in low data regimes—yielding $10$ – $25\%$ relative improvement of test accuracy from classical training. We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective, yielding the first test accuracy improvement results on real-world datasets—CIFAR-10/100—via latent-space perturbations.

^†^†footnotetext: Correspondence to [email protected].

1 Introduction

Deep Neural Networks (DNNs) have shown impressive results across several machine learning tasks [17, 40], and—due to their automatic feature learning—have revolutionized the field of computer vision. However, their success depends on the availability of large annotated datasets for the task at hand. Thus, among other overfitting techniques—such as L1/L2 regularization, dropout [52], early stopping, among others—data augmentation remains a mandatory component that is frequently used in practice.

Traditional data augmentation (DA) techniques apply a predefined set of transformations to the training samples that do not change the corresponding class label, to increase the number of training samples. As this approach is limited to making the classifier robust only to the fixed set of hard-coded transformations, advanced methods incorporate more loosely defined transformations in the data space. For example, mixup [66] uses convex combinations of pairs of examples and their labels, and cutout [9] randomly masks square regions of the input sample. Albeit implicitly, these methods still require domain-specific knowledge that, for example, such masking will not change the label.

Surprisingly, in the context of computer vision, it has been shown that small perturbations in image space that are not visible to the human eye can fool a well-performing classifier into making wrong predictions. This observation motivated an active line of research on adversarial training [see 4, and references therein]—namely, training with such adversarial samples to obtain robust classifiers. However, further empirical studies showed that such training reduces the training accuracy, indicating the two objectives are competing [58, 54].

Stutz et al. [53] postulate that this robustness-generalization trade-off appears due to using off-manifold adversarial attacks that leave the data-manifold and that on-manifold adversarial attacks can improve generalization. For verifying this hypothesis, the authors proposed to use perturbations in the latent space of a generative model. Their proposed method employs (class-specific) models named VAE-GANs [33, 48]—which are based on Generative Adversarial Networks [16] and, to tackle their non-invertibility, further combine GANs with Variational Autoencoders [28]. However, the VAE-GAN model introduces hard-to-tune hyperparameters, and notably, it optimizes a lower bound on the log-likelihood of the data. Moreover, improved test accuracy was only shown on toy datasets [53, Fig. 5], and yet in some cases, the test accuracy did not improve relative to classical training. We observe that on real-world datasets, such training can decrease the test accuracy, see §5.

In this work, we focus on the possibility of employing advanced normalizing flows such as Glow [27], to define entirely unsupervised augmentations—contrasting with pre-defined fixed transformations—with the same goal of improving the generalization of deep classifiers. Although normalizing flows have gained little attention in our community relative to GANs and Autoregressive models, they offer appealing advantages over these models, namely: (i) exact latent-variable inference and log-likelihood evaluation, and (ii) efficient inference and synthesis that can be parallelized [27], respectively. We exploit the exactly reversible encoder-decoder structure of normalizing flows to perform efficient and controllable augmentations in the learned manifold space.

Contributions.

Our contributions can be summarized as:

•

Firstly, we demonstrate through numerical experiments that the previously proposed methods to generate on-manifold perturbations fail to improve the generalization of a trained classifier on real-world datasets. In particular, the test accuracy decreases with such training on CIFRAR-10/100. In this work, we postulate that this occurs due to approximate encoder-decoder mappings.
•

Motivated by this observation, we propose a data augmentation method based on exactly reversible normalizing flows. Namely, it first trains the generative model and then uses simplistic random or adversarial domain-agnostic semantic perturbations to train the classifier, defined in §4.
•

We demonstrate that our adversarial data augmentation method generates on-manifold and semantically meaningful data perturbations. Hence, we argue that our technique is a novel approach for generating perceptually meaningful (natural adversarial examples), different from previous proposals.
•

Finally, we empirically demonstrate that our on-manifold perturbations consistently outperform the standard training on CIFAR-10/100 using ResNet-18. Moreover, in a low-data regime, such training yields up to $25\%$ relative improvement from classical training, of which—as most effective—we find the adversarial perturbations that are adaptive to the classifier, see §5.

2 Related Work

Data augmentation techniques are routinely used to improve the generalization of classifiers [50, 31]. While most classic techniques require a priori expert knowledge of invariances in the dataset to generate virtual examples in the vicinity around each sample in the training data, many automated techniques have been proposed recently, such as linearly interpolating between images and their labels [66], replacing a part of the image with either a black-colored patch [9] or a part of another image [64].

In contrast to these data-agnostic procedures, a few recent works proposed to learn useful data augmentations policies, for instance by optimization [14, 45], reinforcement learning techniques [7, 8, 69], specifically trained augmentation networks [43, 56] or assisted by generative adversarial networks [44, 1, 68, 57], such as also in [39] that proposes neural style transfer for augmenting datasets.

Perturbations in Latent Space allow natural data augmentation with GANs. For instance, Antoniou et al. [1], Zhao et al. [70] propose to apply random perturbations in the latent space, and recently Manjunath et al. [38] used StyleGAN2 [24] to generate novel views of the image through latent space manipulation. However, a critical weakness in these techniques is that the mapping from the latent space to the training data space is typically not invertible, i.e., finding the representation of a data sample in the latent space (to start the search procedure) is a non-trivial task. For instance, Zhao et al. [70] propose to separately train an inverter for the inverse-mapping to the latent space. This critical bottleneck is omitted in our approach since we rely on an invertible architecture which renders the learning of an inverter superfluous.

Latent attacks, i.e., searching in the latent space to find virtual data samples that are misclassified, were proposed in [3, 51, 62, 67]. Volpi et al. [60] proposed an adaptive data augmentation method that appends adversarial examples at each iteration and note that generalization is improved across a range of a priori unknown target domains. Complementary, the connection of adversarial learning and generalization has also been studied in [55, 49, 22, 59, 15, 70]. Stutz et al. [53] clarify the relationship between robustness and generalization by showing in particular that regular adversarial examples leave the data manifold and that on-manifold adversarial training boosts generalization. These important insights endorse previous findings that data augmentation assisted by generative models—as we suggest here—can improve generalization [60].

Perceptual (or Natural) Adversarial examples are getting increasing interest in the community recently, as alternative to—from human perceptive—often hard to interpret standard adversarial threat models [70, 47, 61, 36, 32, 29, 13]. We argue that on-manifold perturbations, as obtained with our method or similar generative techniques, can implicitly learn such natural transformations and could be used as an alternative method to define and generate perceptually and semantically meaningful data augmentation. In contrast to Wong and Kolter [61] who propose to learn perturbation sets via a latent space of a conditional variational autoencoder using a set of predefined image-space transformations, in our approach, we are not restricted to a fixed transformation set as we utilize implicit transformations learned by the invertible mapping provided by normalizing flows.

3 Normalizing Flows and their Advantages for Semantic Perturbations

In this section, we first describe the fundamental concepts of normalizing flows. We then discuss how their ability to perform exact inference helps to apply perturbations in latent space.

3.1 Background: Normalizing Flows

Assume observations ${\bm{x}}\in\mathbb{R}^{d}$ sampled from an unknown data distribution $p_{\mathcal{X}}$ over $\mathcal{X}\subset\mathbb{R}^{d}$ , and a tractable prior probability distribution $p_{\mathcal{Z}}$ over $\mathcal{Z}\subset\mathbb{R}^{k}$ according to which we sample a latent variable ${\bm{z}}$ . Flow-based generative models seek to find an invertible, also called bijective function $\mathcal{F}:\mathcal{X}\rightarrow\mathcal{Z}$ such that:

\mathcal{F}({\bm{x}})={\bm{z}}\quad\text{and}\quad\mathcal{F}^{-1}({\bm{z}})={\bm{x}}\,,

(NF)

with ${\bm{z}}\in\mathcal{Z}$ and ${\bm{x}}\in\mathcal{X}$ . That is, $\mathcal{F}$ maps observations ${\bm{x}}$ to latent codes ${\bm{z}}$ , and $\mathcal{F}^{-1}$ maps latent codes ${\bm{z}}$ back to original observations ${\bm{x}}$ .

The key idea behind normalizing flows is to use change of variables, i.e., by using invertible transformation we keep track of the change in distribution. Thus, $p_{\mathcal{X}}$ induces $p_{\mathcal{Z}}$ through $\mathcal{F}$ and the opposite holds through $\mathcal{F}^{-1}$ . We have:

\displaystyle p_{\mathcal{X}}({\bm{x}})

\displaystyle=p_{\mathcal{Z}}(\mathcal{F}({\bm{x}}))\cdot\Big{|}\det\Big{(}\frac{\partial\mathcal{F}({\bm{x}})}{\partial{\bm{x}}^{\top}}\Big{)}\Big{|}\,,

where the determinant of the Jacobian $\frac{\partial\mathcal{F}({\bm{x}})}{\partial{\bm{x}}^{\top}}$ is used as volume correction. In practice, $\mathcal{F}$ is also differentiable and is parameterized with parameters ${\bm{\omega}}$ , we have finite samples ${\bm{x}}_{i}\sim p_{\mathcal{D}},1\leq i\leq N$ and training is done via maximum log-likelihood:

$\displaystyle{\bm{\omega}}^{\star}{=}\operatorname*{arg\,max}_{{\bm{\omega}}}\sum_{i=1}^{N}\log p_{\mathcal{Z}}(\mathcal{F}({\bm{x}}_{i}|{\bm{\omega}}))+\log\Big{|}\det\Big{(}\frac{\partial\mathcal{F}({\bm{x}}_{i}|{\bm{\omega}})}{\partial{\bm{x}}_{i}^{\top}}\Big{)}\Big{|}\,.$

Because computing the inverse and the determinant is computationally expensive for high-dimensional spaces, $\mathcal{F}$ is constrained to linear transformations that have some structure—often chosen to be triangular Jacobian matrices, which provide efficient computations in both directions.

To build an expressive but tractable $\mathcal{F}$ , we rely on the fact that differentiable functions are closed under composition, thus $\mathcal{F}=f_{\ell}\circ f_{\ell-1}\circ\cdots\circ f_{1}$ , $\ell{>}1$ , is also invertible. In the context of deep learning, this implies that we can stack $\ell$ layers of simple invertible mappings. However, as this still yields a single linear transformation, coupling layers [11] $f(\mathbf{x})={\bm{y}}$ , with $f:\mathbb{R}^{C}\rightarrow\mathbb{R}^{C}$ are inserted, which can be defined in several ways [10, 21]. In this work we use affine coupling transforms, which are empirically shown to perform particularly well for images, and which are used in the Glow model [27]:

$\displaystyle{\bm{y}}_{1:c}{=}{\bm{x}}_{1:c}\quad\text{and}\quad{\bm{y}}_{c+1:C}{=}{\bm{x}}_{c+1:C}\odot\exp(s({\bm{x}}_{1:c}))+t({\bm{x}}_{1:c})\,,$

where $\odot$ is the Hadamard product and, $s$ and $t$ are scaling and translations functions from $\mathbb{R}^{c}\rightarrow\mathbb{R}^{C-c}$ . Moreover, the Jacobian does not require any derivative over $s$ and $t$ , meaning that we can model these functions with arbitrary deep neural networks. To allow that each component can change, usually, $\mathcal{F}$ is composed so that coupling layers are placed in the middle of permutation layers that work in alternating patterns. The Glow model [27] uses an invertible 1 $\times$ 1 convolution layer that generalizes this permutation operation, see Appendix A.1.

3.2 Advantages of Normalizing Flows

Figure 1: Exactness of NF encoding-decoding. Here

\mathcal{F}

denotes the bijective NF, and

\mathcal{G}/\mathcal{G}^{-1}

encoder/decoder pair of inexact methods such as VAE or VAE-GAN which, due to inherent decoder noise, is only approximately bijective.

Figure 2: Data augmentation via perturbation in the latent space. Given a data sample

{\bm{x}}

, natural on-manifold data augmentations are generated by perturbing the encoded

{\bm{z}}=\mathcal{F}({\bm{x}})

in latent space, and decoding the perturbed

{\bm{z}}+\Delta_{\bm{z}}

. Adversarial perturbations require access to the loss function

\mathcal{L}

to either find samples that are misclassified, or most difficult for the current model parameters.

Most popular generative models for computer vision tasks are Variational Autoencoders [VAEs, 28] or Generative Adversarial Networks [GANs, 16].

GANs are widely used in deep learning mainly due to their impressive sample quality, as well as efficient sampling. Nonetheless, by construction, these methods do not provide an invertible mapping from an image ${\bm{x}}$ to its latent representation ${\bm{z}}$ , nor estimating its likelihood under the implicitly learned data distribution $p_{\mathcal{X}}({\bm{x}})$ , except with significant additional compromises [25]. Moreover, despite the notable progress, designing a stable two-player optimization method remains an active research area [6].

VAEs, on the other hand, seemingly resolve these two problems as this class of algorithms is both approximately invertible and notably easier to train. However, VAEs are trained via maximizing a bound on the marginal likelihood and provide only approximate evaluation of $p_{\mathcal{X}}({\bm{x}})$ . Moreover, due to their worse sample quality relative to GANs, researchers propose combining the two [33, 48]—making their performance highly sensitive to their hyperparameter tuning.

In contrast, normalizing flows: (i) perform exact encoding and decoding—due to their construction (see above, and also the illustration in Figure 1), (ii) are highly expressive, (iii) are efficient to sample from, as well as to evaluate $p_{\mathcal{X}}({\bm{x}})$ , (iv) are straightforward to train, and (v) they have useful latent representation—due to their immediate mapping from image to latent representation.

In summary, apart from the obvious benefit of fast encoding and decoding when performing latent-space perturbations, to guarantee that small latent-space perturbations will not modify the sample’s label, the most prominent characteristic of normalizing flows is their exact latent variable inference. After presenting our method for perturbations in latent space and our experimental results, we further discuss the advantages of normalizing flows in §6.

4 Perturbations in Latent Space

Refer to caption — Figure 3: Illustrative results of our latent-space perturbations. The models are trained on CIFAR-10. The first column depicts randomly selected samples from the test set. We depict the outputs obtained with Eq. R–LA and Eq. A–LA as well as their differences with the test samples. By observing the differences, we see that the added perturbations depend on the semantic content of the input image. See §5.4 for further discussion.

The invertibility of normalizing flows enables bidirectional transitions between image and latent spaces, see above §3. This, in turn, allows for applying perturbations directly in the latent space rather than image space. We recall that we denote by $\mathcal{F}:\mathcal{X}\rightarrow\mathcal{Z}$ a trained normalizing flow, mapping from data manifold $\mathcal{X}$ to latent space $\mathcal{Z}$ . Given a perturbation function $\mathcal{P}:\mathcal{Z}\rightarrow\mathcal{Z}$ , defined over the latent space, we define its counterpart in image space as $\mathcal{F}^{-1}(\mathcal{P}(\mathcal{F}({\bm{x}})))$ .

Our goal is to define latent perturbation function $\mathcal{P}(\cdot)$ such that we obtain identity-preserving semantic modifications over the original image ${\bm{x}}$ in the image domain. To this end, we limit the structure of possible $\mathcal{P}$ in two ways. Firstly, we directly consider incremental perturbations of the form ${\bm{z}}+\mathcal{P}({\bm{z}})$ . Secondly, we use an extra $\epsilon$ parameter to control the size of perturbation allowed (see illustration in Figure 2). More precisely, we have:

\mathcal{F}^{-1}\big{(}\mathcal{F}({\bm{x}})+\mathcal{P}(\mathcal{F}({\bm{x}}),\epsilon)\big{)}\,.

For brevity, we refer to $\mathcal{P}$ as latent attacks (LA), and we consider two variants described below.

4.1 Randomized Latent Attacks

At training time, given a datapoint ${\bm{x}}_{i}$ , with $1\leq i\leq N$ , using the trained normalizing flow we obtain its corresponding latent code ${\bm{z}}_{i}=\mathcal{F}({\bm{x}}_{i})$ .

Primarily, as perturbation function we consider a simplistic Gaussian noise in the latent space:

\displaystyle\mathcal{P}_{rand}(\cdot,\epsilon)=\epsilon\cdot\mathcal{N}(0,\mathbf{I})\,,

(R–LA)

which is independent from ${\bm{z}}_{i}$ . Any such distribution around the original ${\bm{z}}_{i}$ is equivalent to sampling from the learned manifold. In this case, the normalizing flow pushes forward this simple Gaussian distribution centered around ${\bm{z}}_{i}$ to a distribution on the image space around ${\bm{x}}_{i}=\mathcal{F}^{-1}({\bm{z}}_{i})$ . Thus, sampling from the simple prior distribution $\mathcal{N}(0,\mathbf{I})$ is equivalent to sampling from a complex conditional distribution around the original image over the data manifold.

We also define norm truncated versions as follows:

\mathcal{P}_{rand}^{\ell_{p}}(\cdot,\epsilon)=\Pi(\epsilon\cdot\mathcal{N}(0,\mathbf{I}))\,,

where $\ell_{p}$ denotes the selected norm, e.g., $\ell_{2}$ or $\ell_{\infty}$ . For $\ell_{2}$ norm, $\Pi$ is defined as $\ell_{2}$ norm scaling, and for $\ell_{\infty}$ , $\Pi$ is the component-wise clipping operation defined below:

(\Pi({\bm{x}}))_{i}:=\max(-\epsilon,\min(+\epsilon,{\bm{x}}_{i}))\,.

4.2 Adversarial Latent Attacks

Analogous to the above randomized latent attacks, at train time, given a datapoint ${\bm{x}}_{i}$ and it’s associated label $l_{i}$ , with $1\leq i\leq N$ , using the trained normalizing flow we obtain its corresponding latent code ${\bm{z}}_{i}=\mathcal{F}({\bm{x}}_{i})$ .

We search for $\Delta_{{\bm{z}}_{i}}\in\mathcal{Z}$ such that the loss obtained of the generated image $\tilde{\bm{x}}_{i}=\mathcal{F}^{-1}({\bm{z}}_{i}+\Delta_{{\bm{z}}_{i}})$ is maximal:

	$\displaystyle\Delta_{{\bm{z}}_{i}}^{\star}$	$\displaystyle=\operatorname*{arg\,max}_{\\|\Delta_{{\bm{z}}_{i}}\\|_{l_{p}}\leq\epsilon}\mathcal{L}_{\theta}(\mathcal{F}^{-1}({\bm{z}}_{i}+\Delta_{{\bm{z}}_{i}}),l_{i})\,,$
	$\displaystyle\mathcal{P}_{adv}^{\ell_{p}}({\bm{z}}_{i},\epsilon)$	$\displaystyle=\Delta_{{\bm{z}}_{i}}^{\star}\,,$		(A–LA)

where $\mathcal{L}_{\theta}$ is the loss function of the classifier, and $\ell_{p}$ denotes the selected norm, e.g., $\ell_{2}$ or $\ell_{\infty}$ .

In practice, we define the number of steps $k$ to optimize for $\Delta_{{\bm{z}}_{i}}^{\star}\in\mathcal{Z}$ , as well as the step size $\alpha$ [similar to 53, 61], and we have the following procedure:

•

Initialize a random $\Delta_{{\bm{z}}_{i}}^{0}$ with $\|\Delta_{{\bm{z}}_{i}}^{0}\|_{\ell_{p}}\leq\epsilon$ .

•

Iteratively update $\Delta_{{\bm{z}}_{i}}^{j}$ for $j=1,\dots,k$ number of steps with step size $\alpha$ as follows:

\Delta_{{\bm{z}}_{i}}^{j}=\Pi\Big{(}\Delta_{{\bm{z}}_{i}}^{j-1}+\alpha\cdot\frac{\nabla\mathcal{L}_{\theta}(\mathcal{F}^{-1}({\bm{z}}_{i}+\Delta_{{\bm{z}}_{i}}^{j-1}),l_{i})}{\|\nabla\mathcal{L}_{\theta}(\mathcal{F}^{-1}({\bm{z}}_{i}+\Delta_{{\bm{z}}_{i}}^{j-1}),l_{i})\|_{\ell_{p}}}\Big{)}

where $\Pi$ is the projection operator that ensures condition $\|\Delta_{{\bm{z}}_{i}}^{j}\|_{\ell_{p}}\leq\epsilon$ and gradient is with respect to $\Delta_{{\bm{z}}_{i}}^{j-1}$ .

•

Output $\mathcal{P}_{adv}({\bm{z}}_{i},\epsilon)=\Delta_{{\bm{z}}_{i}}^{k}$

For the case of $\ell_{\infty}$ , we replace normalization of gradient with $sign(\cdot)$ operator, i.e.:

\Delta_{{\bm{z}}_{i}}^{j}=\Pi\Big{(}\Delta_{{\bm{z}}_{i}}^{j-1}+\alpha\cdot sign\big{(}\nabla\mathcal{L}_{\theta}(\mathcal{F}^{-1}({\bm{z}}_{i}+\Delta_{{\bm{z}}_{i}}^{j-1}),l_{i})\big{)}\Big{)}

and use component-wise clipping for projection, which is equivalent to the standard $\ell_{\infty}$ -PGD adversarial attack of Madry et al. [37].

Similarly, as the normalizing flow directly models the underlying data manifold, this perturbation is equivalent to a search over the on-manifold adversarial samples [53].

5 Experiments

Datasets.

We evaluate our proposed semantic perturbations on the FashionMNIST, SVHN, CIFAR-10, and CIFAR-100 datasets. See Appendix B for additional results on MNIST. For experiments on restricted datasets, e.g., 5% of CIFAR-10, we always use the same sample set for a fair comparison.

Models.

For FashionMNIST, we use a conditional $12$ -step normalizing flow based on Glow coupling blocks and a convolutional network of approximately $100K$ parameters, as in [53]. For experiments on SVHN and CIFAR-10/100, we use Glow [27] and ResNet-18 [17], respectively. See Appendix A for further details on the implementation.

Metrics.

To evaluate the classifier’s generalization, we use standard test accuracy. Adopting from the literature on GANs, we use Fréchet Inception Distance [FID, lower is better, see Appendix A.3, 20] to measure the similarity between the CIFAR-10 training data and samples produced by our latent perturbations.

Methods.

We compare the following methods: (i) standard–classical training with no attacks, (ii) Image-space PGD: Projected Gradient Descent as an image-space, adversarial perturbation baseline [37], (iii) VAE-GAN[53]–on-manifold perturbation method that uses VAE-GANs, (iv) Cutout[9]–input masking, (v) Mixup[66]–data-agnostic data augmentation routine, (vi) Randomized-LA (ours)–randomized latent attacks using normalizing flow, as well as (vii) Adversarial-LA (ours)–adversarial latent attacks using normalizing flow, where ours are described in §4 and the rest of the methods in §2. For brevity, PGD, Randomized-LA and Adversarial-LA are sometimes denoted with $\mathcal{P}_{pgd}$ , $\mathcal{P}_{rand}$ and $\mathcal{P}_{adv}$ , respectively.

5.1 Generalization on CIFAR-10

Method	Low-data	Full-set
Standard (no DA)	$49.8$	$89.7$
Standard $+$ common DA	$64.1$	$95.2$
VAE-GAN [53]	$58.9$	$94.2$
Cutout [9]	$66.8$	$96.0$
Mixup [66]	$73.4$	$95.9$
Randomized-LA	$70.1$	$96.3$
Adversarial-LA	$\mathbf{80.4}$	$\mathbf{96.6}$

Table 1: Test accuracy (

\%

) on CIFAR-10, in the low-data regime compared to the full train set. For the former, we use

5\%

and

100\%

of the training and test set, respectively. In addition to standard training, we consider standard training with commonly used data augmentations (DA) in the image space, which includes rotation and horizontal flips [65], as well as more recent Cutout [9] and Mixup [66] methods. See §5.1 for a discussion.

We are primarily interested in the performance of our perturbations in the low-data regime when using only a small subset of CIFAR-10 as the training set. We train ResNet-18 classifiers on only $5\%$ of the full training set and evaluate models on the full test set. We compare our methods with some of the most commonly used data augmentations methods such as Cutout [9] and Mixup [66], as well as with the VAE-GAN based approach [53]. For [53], we use the authors’ implementation, and their default parameters for CelebA dataset, see Appendix A for details. For [9], we report the best test accuracy observed among a grid search on the learning rate $\eta\in\{0.1,0.01\}$ . Similarly, for [66], we report the best accuracy among grid search on learning rate $\eta\in\{0.1,0.01\}$ and mixup coefficient $\lambda\in\{.1,.2,.3,.4,1.0\}$ . For Randomized-LA, we use $\ell=\ell_{\infty},\epsilon=0.25$ , and for Adversarial-LA, we use $\ell=\ell_{2},\epsilon=1.0,\alpha=0.5,k=3$ .

Table 1 summarizes our generalization experiments in the low data regime—using only $5\%$ of CIFAR-10 for training, compared to the full CIFAR-10 training set. Figure 4 depicts the train and test accuracy throughout the training. Both Randomized-LA and Adversarial-LA notably outperform the standard training baseline. In particular, we observe that (i) our simplistic Randomized-LA method already outperforms some recent strong data augmentation methods, and (ii) Adversarial-LA achieves best test accuracy for both low-data and full-set regimes. See §5.3 below for additional benchmarks with VAE-GAN [53].

5.2 Transfer Learning Experiments

To further analyze potential applications of our normalizing flow based latent attacks to real-world use cases, we study if a normalizing flow pre-trained on a large dataset can be used for training classifiers on a different, smaller dataset. In particular, we use CIFAR-10 to train the normalizing flows and then our latent attacks to train a classifier on $10\%$ and $5\%$ of the CIFAR-100 and SVHN training datasets, respectively.

Table 2 shows our results for CIFAR-100 using a selection of latent attacks. Randomized-LA and Adversarial-LA achieve $16\%$ and $24\%$ improvements over the standard baseline. The results indicate that normalizing flows are capable of transferring useful augmentations learned from CIFAR-10 to CIFAR-100.

Table 3 shows our results for SVHN. To provide a baseline on the effect of using different datasets for normalizing flows and classifiers, we also provide results with pre-training on SVHN. Latent attacks transferred from CIFAR-10 achieve superior performance to direct pre-training on SVHN, indicating that transferring augmentations across datasets is indeed a promising direction.

Perturbation	Accuracy
Standard	$36.4$
Randomized-LA, $\ell{=}\ell_{\infty},\epsilon{=}.2$	$39.7$
Randomized-LA, $\ell{=}\ell_{\infty},\epsilon=.3$	$41.0$
Randomized-LA, $\ell{=}\ell_{2},\epsilon=10$	$40.4$
Randomized-LA, $\ell{=}\ell_{2},\epsilon=20$	$\mathbf{42.3}$
Adversarial-LA, $\ell{=}\ell_{2},\alpha{=}.5,k{=}3$	$\mathbf{45.0}$

Table 2: Test accuracy (

\%

) on CIFAR-100, in the low-data regime, where we use

10\%

of the training set and the full test set. The normalizing flow used is trained on CIFAR-10.

NF	Perturbation	Accuracy
–	Standard	$81.2$
SVHN	$\mathcal{P}_{rand}^{\ell_{2}}$ , $\epsilon{=}15.$	$84.9$
SVHN	$\mathcal{P}_{adv}^{\ell_{2}}$ , $\epsilon{=}.5,\alpha{=}.25,k{=}2$	$86.9$
CIFAR-10	$\mathcal{P}_{rand}^{\ell_{2}}$ , $\epsilon{=}15.$	$\mathbf{90.0}$
CIFAR-10	$\mathcal{P}_{adv}^{\ell_{2}}$ , $\epsilon{=}.3,\alpha{=}.15,k{=}2$	$\mathbf{90.5}$

Table 3: Test accuracy (

\%

) on SVHN, in the low-data regime, where we use

5\%

of the training set and the full test set. Comparison of normalizing flows trained on CIFAR-10, versus SVHN.

5.3 Additional Comparison with VAE-GAN

Following Stutz et al. [53], we study the performance of our latent perturbation-based training strategies in varying settings, starting from low-data regime to full-set. For the VAE-GAN results, we use the source code provided by the authors, while using their default hyperparameters for the same dataset. For our methods, we reproduce the same classifier and hyperparameter setup. For Randomized-LA, we use $\ell=\ell_{\infty},\epsilon=0.15$ , and for Adversarial-LA, we use $\ell=\ell_{\infty},\epsilon=0.05,\alpha=0.01,k=10$ .

Figure 5 shows our average results for $3$ runs with training sizes in $\{600,2400,6000,50000\}$ . We observe that Randomized-LA performs comparatively to the standard training baseline, whereas Adversarial-LA outperforms the standard baseline across all train set sizes. Note that the difference to the standard baselines shrinks as we increase the number of samples available to the classifiers.

In line with our results, Stutz et al. [53] report diminishing performance gains for increasingly challenging datasets such as FashionMNIST to CelebA, when using therein VAE-GAN based approach. One potential cause could be the approximate encoding and decoding mappings or sensitivity to hyperparameter tuning. Relative to VAE-GAN, normalizing flows have significantly fewer hyperparameters, see Appendix A.2. Indeed, our results support the numerous appealing advantages of normalizing flows for latent-space perturbations and indicate that they have a better capacity to produce useful augmented training samples.

5.4 Analysis of Generated Images

Model or Perturbation	FID
Baseline: GANs
DCGAN [20]	36.9
WGAN-GP [20]	24.8
BigGAN [5]	14.73
StyleGAN [23]	2.92
\hdashline
Baseline: image-space
PGD [37], $\ell{=}\ell_{\infty},\epsilon{=}.03,\alpha{=}.008,k{=}10$	23.61
Ours: latent-space
Randomized-LA, $\ell{=}\ell_{\infty},\epsilon{=}.25$	3.71
Adversarial-LA, $\ell{=}\ell_{2},\epsilon{=}1.,\alpha{=}.5,k{=}3$	3.65

Table 4: FID scores (lower is better) of generated samples of GANs, image-space PGD perturbations, and our Randomized-LA and Adversarial-LA methods. For PGD and Adversarial-LA perturbations which use a classifier, we use the same standardly trained ResNet-18. See §5.4.

Figure 3 depicts samples of our Randomized-LA and Adversarial-LA methods. Primarily, in contrast to random image-space perturbations, we observe that both Randomized-LA and Adversarial-LA yield perturbations dependent on the semantic content of the input image. Interestingly, one could argue that Adversarial-LA further masks potential shortcuts that the classifier may learn, for e.g., by masking the windows, it forces the classifier to, in fact, learn the shape of a car. Moreover, we observe that relative to image-space perturbations, latent attacks produce samples that are semantically closer to the CIFAR-10 training set—see Table 4 for FID scores, and at the same time, more distinct in image space—see Table 5.

Perturbation	$\ell_{2}$ in $\mathcal{X}$	$\ell_{\infty}$ in $\mathcal{X}$
Baseline: image-space
$\mathcal{P}_{pgd}^{\ell_{\infty}}$ , $\epsilon=.03,\alpha=.008,k=10$	$1.13$	$0.03$
$\mathcal{P}_{pgd}^{\ell_{2}}$ , $\epsilon=2.,\alpha=.5,k=10$	$1.98$	$0.15$
Ours: latent-space
$\mathcal{P}_{rand}^{\ell_{\infty}}$ , $\epsilon=.25$	4.18	0.41
$\mathcal{P}_{adv}^{\ell_{2}}$ , $\epsilon=1.,\alpha=.5,k=3$	$4.61$	$0.44$

Table 5: Average

\ell_{2}

and

\ell_{\infty}

size of perturbations computed in image space

\mathcal{X}

using CIFAR-10 test samples. For PGD and Adversarial-LA perturbations which use a classifier, we use the same standardly trained ResNet-18.

5.5 Robustness against Latent Attacks

Attack	Trained Perturbation	Acc.	Drop
$\mathcal{P}_{rand}^{\ell_{\infty}}$	Standard	90.5	4.8
	$\mathcal{P}_{pgd}^{\ell_{\infty}}$ , $\epsilon=.03,\alpha=.008,k=10$	76.9	9.4
	$\mathcal{P}_{rand}^{\ell_{\infty}}$ , $\epsilon=.25$	94.1	2.1
	$\mathcal{P}_{adv}^{\ell_{2}}$ , $\epsilon=1.,\alpha=.5,k=3$	94.6	2.0
$\mathcal{P}_{adv}^{\ell_{2}}$	Standard	58.8	38.2
	$\mathcal{P}_{pgd}^{\ell_{\infty}}$ , $\epsilon=.03,\alpha=.008,k=10$	36.2	57.3
	$\mathcal{P}_{rand}^{\ell_{\infty}}$ , $\epsilon=.25$	71.2	25.9
	$\mathcal{P}_{adv}^{\ell_{2}}$ , $\epsilon=1.,\alpha=.5,k=3$	76.4	20.8

Table 6: Robustness against our perturbations

{\mathcal{P}_{rand}^{\ell_{\infty}},\epsilon=.25}

(top) and

{P_{adv}^{\ell_{2}},\epsilon=1.,\alpha=.5,k=3}

(bottom) on the CIFAR-10 dataset. Trained Perturbation: the training-time perturbation used to train the model; Drop the drop in test accuracy with latent perturbations relative to the accuracy on CIFAR-10 test samples.

In Table 6, we evaluate the robustness of classifiers against our latent attacks and observe that both standard and image-space adversarial training suffer from a significant loss of performance against Adversarial-LA. Combined with observations from §5.4, this indicates that our adversarial latent attack is a novel approach to generate realistic adversarial samples. Interestingly, classifiers trained with image-space adversarial perturbations are more prone to large accuracy drops than standardly trained classifiers. Additionally, although the classifiers trained with our perturbations are robust to Randomized-LA, they are not fully robust to Adversarial-LA, suggesting the possibility of further improving generalization using latent attacks.

6 Discussion

Exact Coding.

As formalized in §3, normalizing flows can perform exact encoding and decoding by their construction. That is, the decoding operation is exactly the reverse of the encoding operation. Any continuous encoder maps a neighborhood of a sample to some neighborhood of its latent representation. However, the invertibility of normalizing flows also maps any neighborhood of latent code to a neighborhood of the original sample. In principle, this property also holds for off-manifold samples and may explain the effectiveness of our methods in transferring augmentations.

Increasing Dataset Size.

The primary advantage of exact coding is that the generated samples via latent perturbations improve the generalization of classifiers, as shown in §5.1. To understand why this occurs, consider the limit case $\epsilon\to 0$ for a latent perturbation. Assuming a numerically stable normalizing flow, we recover the original data samples, hence the training distribution. As we increase $\epsilon$ , this distribution grows around each data point. Thus, by increasing $\epsilon$ , we add further plausible data points to our training set, as long the learned latent representation is a good approximation of the underlying data manifold. This does not necessarily hold for approximate methods due to inherent decoder noise.

Controllability.

In §4, we introduced two variants of latent perturbations that define different procedures around the latent code of the original sample. Each variant employs a normalizing flow to efficiently map a complex on-manifold objective to a local objective in the latent space. The randomized latent attack defines a sampling operation on the data manifold, and the adversarial latent attack, a stochastic search procedure to find on-manifold samples attaining high classifier losses. In principle, any other on-manifold objectives may also utilize such mappings to the latent space and, potentially, use the density provided by the normalizing flow to enforce strict checks for on-manifold data points. Moreover, conditional normalizing flows may achieve more expressive, class-specific augmentations and control mechanisms.

Compatibility with Data Augmentations.

It is important to note that our method is orthogonal to image-space data augmentation methods. In other words, we can train normalizing flows with commonly used data augmentations. As observed in Figure 3, trained models can apply some of the training-time augmentations to CIFAR-10 test samples. This allows us to encode and decode augmented samples as well as original samples of CIFAR-10. Additionally, we can use DeVries and Taylor [9], Zhang et al. [66] concurrently with our latent perturbations to train classifiers.

7 Conclusion

Motivated by the numerous advantages of normalizing flows, we propose flow-based latent perturbation methods to augment the training datasets to train classifiers. Our extensive empirical results on several real-world datasets demonstrate the efficacy of these perturbations for improving generalization both in full and low-data regimes. In particular, these perturbations can increase sample efficiency in low-data regimes and, in practice, reduce labeling efforts.

Further directions include (i) decoupling the effects of exact coding from any modeling gains through ablation studies, as well as (ii) combining image and latent-space augmentations.

Acknowledgments

TC was funded in part by the grant P2ELP2_199740 from the Swiss National Science Foundation. The authors would like to thank Maksym Andriushchenko and Suzan Üsküdarlı for insightful feedback and discussions.

References

Antoniou et al. [2017] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
Ardizzone et al. [2019] Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392, 2019.
Baluja and Fischer [2017] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
Biggio and Roli [2018] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
Chavdarova et al. [2021] Tatjana Chavdarova, Matteo Pagliardini, Sebastian U Stich, François Fleuret, and Martin Jaggi. Taming GANs with Lookahead-Minmax. In International Conference on Learning Representations, 2021.
Cubuk et al. [2019] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 113–123, 2019.
Cubuk et al. [2020] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
DeVries and Taylor [2017] Terrance DeVries and Graham W. Taylor. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708.04552, 2017. arXiv: 1708.04552.
Dinh et al. [2015] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, ICLR, 2015.
Dinh et al. [2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, ICLR, 2017.
Dinh et al. [2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations (ICLR), 2017.
Dolatabadi et al. [2020] Hadi M Dolatabadi, Sarah Erfani, and Christopher Leckie. Advflow: Inconspicuous black-box adversarial attacks using normalizing flows. arXiv preprint arXiv:2007.07435, 2020.
Fawzi et al. [2016] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard. Adaptive data augmentation for image classification. In IEEE International Conference on Image Processing (ICIP), 2016.
Gilmer et al. [2018] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), volume 27, pages 2672–2680, 2014.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), pages 6626–6637, 2017.
Ho et al. [2019] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In Proceedings of the 36th International Conference on Machine Learning, pages 2722–2730, 2019.
Jalal et al. [2017] Ajil Jalal, Andrew Ilyas, Constantinos Daskalakis, and Alexandros G. Dimakis. The robust manifold defense: Adversarial training using generative models. arXiv preprint arXiv:1712.09196, 2017.
Karras et al. [2020] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676, 2020.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8110–8119, 2020.
Kilcher et al. [2017] Yannic Kilcher, Aurélien Lucchi, and Thomas Hofmann. Generator reversal, 2017.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pages 10215–10224, 2018.
Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
Kireev et al. [2021] Klim Kireev, Maksym Andriushchenko, and Nicolas Flammarion. On the effectiveness of adversarial training against common corruptions. arXiv preprint arXiv:2103.02325, 2021.
Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. page 60, 2009.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
Laidlaw et al. [2021] Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. In International Conference on Learning Representations (ICLR), 2021.
Larsen et al. [2016] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov. 1998. Conference Name: Proceedings of the IEEE.
Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
Luo et al. [2020] Calvin Luo, Hossein Mobahi, and Samy Bengio. Data augmentation via structured adversarial perturbations. arXiv preprint arXiv:2011.03010, 2020.
Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018.
Manjunath et al. [2020] Shashank Manjunath, Aitzaz Nathaniel, Jeff Druce, and Stan German. Improving the performance of fine-grain image classifiers via generative data augmentation. arXiv preprint arXiv:2008.05381, 2020.
Mikołajczyk and Grochowski [2019] Agnieszka Mikołajczyk and Michał Grochowski. Style transfer-based image synthesis as an efficient regularization technique in deep learning. In 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), pages 42–47. IEEE, 2019.
Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
Nesterov [1983] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2). Dokl. Akad. Nauk SSSR, 269:543–547, 1983.
Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Peng et al. [2018] Xi Peng, Zhiqiang Tang, Fei Yang, Rogerio S. Feris, and Dimitris Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Perez and Wang [2017] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
Ratner et al. [2017] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems (NeurIPS), volume 30. Curran Associates, Inc., 2017.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Robey et al. [2020] Alexander Robey, Hamed Hassani, and George J. Pappas. Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020.
Rosca et al. [2017] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational approaches for auto-encoding generative adversarial networks, 2017.
Rozsa et al. [2016] Andras Rozsa, Manuel Günther, and Terrance E. Boult. Are accuracy and robustness correlated. In 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 227–232. IEEE Computer Society, 2016.
Simard et al. [1998] Patrice Y. Simard, Yann A. LeCun, John S. Denker, and Bernard Victorri. Transformation Invariance in Pattern Recognition — Tangent Distance and Tangent Propagation, pages 239–274. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
Song et al. [2018] Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems (NeurIPS), volume 31. Curran Associates, Inc., 2018.
Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research (JMLR), 15(1):1929–1958, 2014.
Stutz et al. [2019] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6976–6987, 2019.
Su et al. [2018] Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? - A comprehensive study on the robustness of 18 deep image classification models. In ECCV, 2018.
Tanay and Griffin [2016] Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
Tang et al. [2020] Zhiqiang Tang, Yunhe Gao, Leonid Karlinsky, Prasanna Sattigeri, Rogerio Feris, and Dimitris Metaxas. Onlineaugment: Online data augmentation with less domain knowledge. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 313–329. Springer, 2020.
Tran et al. [2020] Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. On data augmentation for GAN training. arXiv preprint arXiv:2006.05338, 2020.
Tsipras et al. [2019] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2019.
Tsipras et al. [2019] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations (ICLR), 2019.
Volpi et al. [2018] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C. Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in neural information processing systems (NeurIPS), pages 5334–5344, 2018.
Wong and Kolter [2021] Eric Wong and J Zico Kolter. Learning perturbation sets for robust machine learning. In International Conference on Learning Representations (ICLR), 2021.
Xiao et al. [2018] Chaowei Xiao, Bo Li, Jun yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pages 3905–3911, 7 2018.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
Yun et al. [2019] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6023–6032, 2019.
Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Procedings of the British Machine Vision Conference 2016, pages 87.1–87.12, York, UK, 2016. British Machine Vision Association.
Zhang et al. [2018] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR), 2018.
Zhang et al. [2020] Linfeng Zhang, Muzhou Yu, Tong Chen, Zuoqiang Shi, Chenglong Bao, and Kaisheng Ma. Auxiliary training: Towards accurate and robust models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Zhang et al. [2018] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 31. Curran Associates, Inc., 2018.
Zhang et al. [2020] Xinyu Zhang, Qiang Wang, Jian Zhang, and Zhao Zhong. Adversarial autoaugment. In International Conference on Learning Representations (ICLR), 2020.
Zhao et al. [2018] Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations (ICLR), 2018.

Appendix A Details on the implementation

In this section, we list all the details of the implementation.

Source Code.

Our source code is provided in this repository: https://github.com/okyksl/flow-lp.

A.1 Architectures

Generative model (NF) architecture.

We use Glow [27] for the normalizing flow architecture. For the MNIST [34] and FashionMNIST [63] experiments, we use a conditional, 12-step, Glow-coupling-based architecture similar to [2]. See Table 7 for the details. For the CIFAR-10/100 [30] and SVHN [42] experiments, we use the original Glow architecture described in [27], i.e., 3 scales of 32 steps each containing activation normalization, affine coupling and invertible 1 $\times$ 1 convolution. We adapt an existing PyTorch implementation in ¹¹1https://github.com/chrischute/glow to better match the original Tensorflow implementation in ²²2https://github.com/openai/glow. For more details on multi-scale architecture in normalizing flows, see [12].

Generative Model

Input:

{\bm{x}}\in\mathbb{R}^{784},{\bm{y}}\in\mathbb{R}^{10}

\hdashlineGLOWCouplingBlock

PermuteRandom

\hdashline

\cdots

\times 10

\cdots

\hdashlineGLOWCouplingBlock

PermuteRandom

GLOWCouplingBlock

Input:

{\bm{x}}\in\mathbb{R}^{784},{\bm{y}}\in\mathbb{R}^{10}

\hdashlinesplit

{\bm{x}}\rightarrow{\bm{x}}_{1},{\bm{x}}_{2}

(

784\rightarrow{392},{392}

)

\hdashlinesubnet

{\bm{x}}_{2}\oplus{\bm{y}}\rightarrow\mathbf{s}_{1},\mathbf{t}_{1}

(

402\rightarrow 392,392

)

affine coupling

{\bm{x}}_{1},\mathbf{s}_{1},\mathbf{t}_{1}\rightarrow{\bm{z}}_{1}

(

3{\times}392\rightarrow 392

)

\hdashlinesubnet

{\bm{z}}_{1}\oplus{\bm{y}}\rightarrow\mathbf{s}_{2},\mathbf{t}_{2}

(

402\rightarrow 392,392

)

affine coupling

{\bm{x}}_{2},\mathbf{s}_{2},\mathbf{t}_{2}\rightarrow{\bm{x}}_{2}^{{}^{\prime}}

(

3{\times}392\rightarrow 392

)

\hdashlineconcat.

{\bm{z}}_{1}\oplus{\bm{z}}_{2}

(

392,392\rightarrow 784

)

Subnets
Input: ${\bm{x}}\in\mathbb{R}^{402}$
\hdashlinelinear ( $402\rightarrow 512$ )
ReLU
linear ( $512\rightarrow 784$ )
split ( $784\rightarrow 392,392$ )

Table 7: Normalizing flow architectures used for our experiments on MNIST and FashionMNIST. With

c_{in}\rightarrow y_{out}

, we denote the number of channels of the input and output of the layer. With

\oplus

, we denote concatenation operation. We use the implementation provided in https://github.com/VLL-HD/FrEIA. For more details on affine coupling layers, see §3.

Classifier architecture.

For our experiments on MNIST, we use LeNet-5 [34] with replaced nonlinearity–instead of $\tanh$ we use ReLU, and we initialize the network parameters with truncated normal distribution $\sigma=0.1$ . For the FashionMNIST experiments, we use the same classifier as used in [53]. See Table 8 for more details. For CIFAR-10/100 and SVHN, we use the ResNet-18 [19] architecture as implemented in [9, 66]. This ResNet-18 includes slight modifications over the standard ResNet-18 architecture in order to achieve better performance on CIFAR-10/100. See ³³3https://github.com/facebookresearch/mixup-cifar10 and ⁴⁴4https://github.com/uoguelph-mlrg/Cutout for implementation. In particular, the first layer is changed to a $3\times 3$ convolution with stride ${1}$ and padding ${1}$ , from the original $7\times 7$ convolution with stride $2$ and padding $3$ . Additionally, the following max-pooling layer is removed. For CIFAR-10, we also use a similarly modified ResNet-20 [18].

LeNet-5
Input: ${\bm{x}}\rightarrow\mathbb{R}^{1{\times}28{\times}28}$
\hdashlineconvolution (ker: $5{\times}5$ , $1\rightarrow 6$ ; stride: $1$ ; pad: $2$ )
ReLU
AvgPool2d (ker: $2{\times}2$ )
convolution (ker: $5{\times}5$ , $6\rightarrow 16$ ; stride: $1$ ; pad: $0$ )
ReLU
AvgPool2d (ker: $2{\times}2$ )
Flatten ( $16{\times}5{\times}5\rightarrow 400$ )
linear ( $400\to 120$ )
ReLU
linear ( $120\to 84$ )
ReLU
linear ( $120\to 10$ )
ReLU

CNN from [53]
Input: ${\bm{x}}\in\mathbb{R}^{1{\times}28{\times}28}$
\hdashlineconvolution (ker: $4{\times}4$ , $1\rightarrow 16$ ; stride: $2$ ; pad: $1$ )
Batch Normalization
ReLU
convolution (ker: $4{\times}4$ , $16\rightarrow 32$ ; stride: $2$ ; pad: $1$ )
Batch Normalization
ReLU
convolution (ker: $4{\times}4$ , $32\rightarrow 64$ ; stride: $2$ ; pad: $1$ )
Batch Normalization
ReLU
Flatten ( $64{\times}3{\times}3\rightarrow 576$ )
linear ( $576\to 100$ )
linear ( $100\to 10$ )

Table 8: Convolutional Neural Network (CNN) architectures used for our experiments on MNIST and FashionMNIST. We use ker and pad to denote kernel and padding for the convolution layers, respectively. With

h{\times}w

, we denote the kernel size. With

c_{in}\rightarrow y_{out}

, we denote the number of channels of the input and output of the layer.

A.2 Hyperparameters

Generative Models.

For MNIST and FashionMNIST, we use the Adam [26] optimizer with a batch size of $100$ and learning rate of $10^{-6}$ for $100$ epochs to train normalizing flows. For CIFAR-10 and SVHN, we use the Adamax [26] optimizer with a learning rate of $0.0005$ and weight decay of $0.00005$ . We use a warmup learning rate schedule for the first $500.000$ steps of the training. That is, the learning rate is linearly increased from $0$ to the base learning rate $0.0005$ in $500.000$ steps.

For VAE-GAN training, we run the implementation provided by authors⁵⁵5https://github.com/davidstutz/disentangling-robustness-generalization with the default architectures and parameters. That is, for FashionMNIST, we use $\beta=2.75$ , $\gamma=1$ , $\eta=0$ and latent space size of $10$ . We use the Adam optimizer with a batch size of $100$ , learning rate of $0.005$ , weight decay of $0.0001$ and train VAE-GANs for $60$ epochs with an exponential decay scheduling of $0.9$ for the learning rate. For CIFAR-10, we use the CelebA [35] setup provided (the only 3-channel color dataset provided) and thus use $\beta=3.0$ , latent space size of $25$ and 30 epochs instead. Note that we report On-Learned-Manifold Adversarial Training from [53] which uses class-specific VAE-GANs. That is, 10 VAE-GAN architectures are trained for both FashionMNIST and CIFAR-10 datasets.

Discussion on Hyperparameters of Generative Models.

As normalizing flows directly optimize the log-likelihood of the data, there are no hyperparameters in their loss function. Additionally, the normalizing flows that we use have a fixed latent dimension equal to the input dimension due to their architectural design. As noted in §5.3, this is in contrast to VAE-GAN used in [53] where the training involves optimizing separate losses for three networks (namely, encoder, decoder, and discriminator) concurrently. Coefficients called $\beta$ , $\gamma$ , and $\eta$ are used to scale reconstruction, decoder, and discriminator loss, respectively. Additionally, the latent size for VAE-GAN is hand-picked for each dataset.

Classifiers.

For MNIST, we use the Adam optimizer with a learning rate of $0.001$ and weight decay of $0.001$ . We train LeNet-5 classifiers for $20$ epochs with exponential learning decay of rate $0.1$ for $10.000$ steps. For FashionMNIST, we use the training setup used in [53]. That is, we use the Adam optimizer with a learning rate of $0.01$ and weight decay of $0.0001$ . We train classifiers for $20$ epochs with exponential learning decay of rate $0.9$ for $500$ steps. For CIFAR-10/100, we use the training setup used in [9, 66]. More precisely, we use Stochastic Gradient Descent (SGD) [46] with a batch size of $128$ , learning rate of $0.1$ , weight decay of $0.0005$ , and Nesterov momentum [41] of $0.9$ . We train ResNet-18 and ResNet-20 classifiers for $200$ epochs and multiply the learning rate by $0.2$ at epochs $\{60,120,160\}$ . For SVHN, we use the same optimizer with a weight decay of $0.0001$ . We train ResNet-18 classifiers for $120$ epochs and multiply the learning rate by $0.1$ at epochs $\{30,60,90\}$ .

Data Augmentation.

For CIFAR-10/100, we use standard data augmentation akin to [65]. That is, we zero-pad images with $4$ pixels on each side, take a random crop of size $32\times 32$ , and then mirror the resulting image horizontally with ${50}\%$ probability. We use such data augmentation for both training the generative and the classifier models. Hence, our normalizing flows are capable of encoding-decoding operations on augmented samples as well. Advanced data augmentation baselines we use in Table 1 [9, 66], also include the same standard data augmentations. However, the VAE-GAN based approach [53] does not use data augmentation in their generative model. To provide a more direct comparison between the performance of two generative models, in §B.2 we conduct an additional study without any data augmentations.

A.3 Metrics

Fréchet Inception Distance.

FID [20] aims at comparing the synthetic samples $x\sim p_{g}$ —where $p_{g}$ denotes the distribution of the samples of the given generative model, with those of the training data of $x\sim p_{d}$ in a feature space. The samples are embedded using the first several layers of the Inception network. Assuming $p_{g}$ and $p_{d}$ are multivariate normal distributions, it then estimates the means ${\bm{m}}_{g}$ and ${\bm{m}}_{d}$ and covariances $C_{g}$ and $C_{d}$ , respectively for $p_{g}$ and $p_{d}$ in that feature space. Finally, FID is computed as:

\displaystyle\mathbb{D}_{\text{FID}}(p_{d},p_{g})

\displaystyle\approx d^{2}(({\bm{m}}_{d},C_{d}),({\bm{m}}_{g},C_{g}))=\|{\bm{m}}_{d}-{\bm{m}}_{g}\|_{2}^{2}+Tr(C_{d}+C_{g}-2(C_{d}C_{g})^{\frac{1}{2}}),

(FID)

where $d^{2}$ denotes the Fréchet Distance. Note that as this metric is a distance, the lower it is, the better the performance. We used the implementation of FID⁶⁶6https://github.com/mseitzer/pytorch-fid in PyTorch.

Appendix B Additional Results

B.1 Results on MNIST

Table 9 summarizes our results on MNIST in full data regime. Although the baseline has a very good performance on this dataset, we observe improved generalization.

Perturbation	Train Accuracy	Train Loss	Test Accuracy	Test Loss
Standard	$99.80$	$0.0069$	$99.24$	$0.0288$
Randomized-LA, $\ell{=}{\ell_{\infty}},\epsilon{=}0.15$	$99.78$	$0.0076$	$99.28$	$0.0262$
Adversarial-LA, $\ell{=}{\ell_{\infty}},\epsilon{=}0.05,\alpha{=}0.01,k{=}10$	$99.26$	$0.0230$	$99.43$	$0.0216$

Table 9: Train and test accuracy (

\%

) as well as loss on MNIST. Comparison with standard training, versus our latent-space perturbations.

B.2 Additional Results on CIFAR-10

Results without Data Augmentation.

To provide a direct comparison between two generative models and eliminate the effect of data augmentation, we run additional experiments. Table 10 shows results for our latent perturbations without any data augmentation to train the normalizing flow and the classifier. In line with our FashionMNIST results in §5.3, we observe that both randomized and adversarial latent attacks overperform the standard baseline and the VAE-GAN based approach.

Method	Accuracy
Standard	$49.8$
VAE-GAN	$49.4$
Randomized-LA	$54.9$
Adversarial-LA	$58.2$

Table 10: Test accuracy (

\%

) on CIFAR-10, in the low-data regime (

5\%

of training samples) without any data augmentation.

Results with ResNet-20.

Table 11 summarizes our results using the ResNet-20, on CIFAR-10. Inline with our ResNet-18 results in §5.1, we observe that both randomized and adversarial latent attacks overperform the standard baseline.

Method	Accuracy
Standard	$65.6$	–
Randomized-LA, $\ell{=}\ell_{2},\epsilon{=}25.$	$72.7$
Adversarial-LA, $\ell{=}\ell_{2},\epsilon=.5$	$77.1$

Table 11: Test accuracy (

\%

) on CIFAR-10 using ResNet-20, in the low-data regime (

5\%

of the training set).

Results with Different Attack Parameters.

In Table 12, we provide results with varying hyperparameters for the different attacks. Observe that for Adversarial-LA, in the high perturbation setting—where $\epsilon=2.0$ , the classifier still didn’t fully fit to the training set, but performance in the test set is above the standard baseline.

Multi-step Training.

We run additional experiments where we sequentially apply different attack hyperparameters in multi-step training with weaker perturbations to increase the performance on the test set. The results are listed in Table 12, denoted with $+$ .

Perturbation	Train Accuracy	Train Loss	Test Accuracy	Test Loss
Baselines:
Standard	$100.0$	$0.002$	$95.2$	$0.194$
PGD, $\ell{=}{\ell_{2}},\epsilon{=}2.0,\alpha{=}0.5,k{=}10$	$61.13$	$0.895$	$75.7$	$0.731$
PGD, $\ell{=}{\ell_{\infty}},\epsilon{=}0.03,\alpha{=}0.008,k{=}10$	$77.3$	$0.521$	$86.3$	$0.442$
Ours:
Randomized-LA, $\ell{=}{\ell_{2}},\epsilon{=}10.0$	$99.8$	$0.007$	$95.8$	$0.161$
Randomized-LA, $\ell{=}{\ell_{\infty}},\epsilon{=}0.25$	$99.5$	$0.015$	$\mathbf{96.3}$	$0.142$
$+$ Randomized-LA, $\ell{=}\ell_{\infty},\epsilon{=}0.15$	$100.0$	$0.002$	$\mathbf{96.4}$	$0.133$
\hdashline
Adversarial-LA, $\ell{=}{\ell_{2}},\epsilon{=}1.0,\alpha{=}0.5,k{=}3$	$99.9$	$0.005$	$\mathbf{96.6}$	$0.126$
Adversarial-LA, $\ell{=}{\ell_{2}},\epsilon{=}2.0,\alpha{=}1.5,k{=}2$	$89.1$	$0.214$	$95.8$	$0.134$
$+$ Adversarial-LA, $\ell{=}{\ell_{2}},\epsilon{=}1.0,\alpha{=}0.75,k{=}2$	$99.2$	$0.030$	$96.5$	$0.114$
$+$ Adversarial-LA, $\ell{=}{\ell_{2}},\epsilon{=}0.75,\alpha{=}0.5,k{=}2$	$99.7$	$0.011$	$\mathbf{96.7}$	$0.115$
$+$ Randomized-LA, $\ell{=}\ell_{\infty},\epsilon{=}0.25$	$100.0$	$0.002$	$96.5$	$0.132$
$+$ Randomized-LA, $\ell{=}{\ell_{2}},\epsilon{=}10.0$	$100.0$	$0.002$	$96.6$	$0.131$

Table 12: Train and test accuracy (

\%

) as well as loss on CIFAR-10 using ResNet-18. All of the models are trained with the same hyperparameters listed in §A.2. Perturbations listed with the

+

sign indicates a multi-step training. For example, last row lists the result of the model trained with

P_{adv}^{\ell_{2}},\epsilon=2.0,\alpha=1.5,k=2

for 130 epochs,

P_{rand}^{\ell_{\infty}},\epsilon=0.25

for 40 epochs and

P_{rand}^{\ell_{2}},\epsilon=10.0

for 30 epochs. Note that, regardless of multi-step training, the hyperparameters, including the total number of training epochs (

=200

), remain fixed across the experiments.

Test sample	Randomized-LA	Difference	Adversarial-LA	Difference
${\bm{x}}$	$\tilde{\bm{x}}_{rand}$	$\tilde{\bm{x}}_{rand}-{\bm{x}}$	$\tilde{\bm{x}}_{adv}$	$\tilde{\bm{x}}_{adv}-{\bm{x}}$