StyleAugment: Learning Texture De-biased Representations
by Style Augmentation without Pre-defined Textures

Sanghyuk Chun
NAVER AI Lab
Song Park
Yonsei University

Abstract

Recent powerful vision classifiers are biased towards textures, while shape information is overlooked by the models. A simple attempt by augmenting training images using the artistic style transfer method, called Stylized ImageNet, can reduce the texture bias. However, Stylized ImageNet approach has two drawbacks in fidelity and diversity. First, the generated images show low image quality due to the significant semantic gap betweeen natural images and artistic paintings. Also, Stylized ImageNet training samples are pre-computed before training, resulting in showing the lack of diversity for each sample. We propose a StyleAugment by augmenting styles from the mini-batch. StyleAugment does not rely on the pre-defined style references, but generates augmented images on-the-fly by natural images in the mini-batch for the references. Hence, StyleAugment let the model observe abundant confounding cues for each image by on-the-fly the augmentation strategy, while the augmented images are more realistic than artistic style transferred images. We validate the effectiveness of StyleAugment in the ImageNet dataset with robustness benchmarks, such as texture de-biased accuracy, corruption robustness, natural adversarial samples, and occlusion robustness. StyleAugment shows better generalization performances than previous unsupervised de-biasing methods and state-of-the-art data augmentation methods in our experiments.

1 Introduction

While deep neural networks have shown the remarkable success comparing humans in complex vision recognition systems [13, 31], deep neural networks have shown disappointed generalization performances against unfamiliar corruptions, such as noises, blurs, small perturbations, visual filters, or occlusions [14, 12, 5]. This fundamental limitation often weaken practical usages of deep models in real-world deployment scenarios, such as self-driving cars [28]. As a naive approach for improving robustness against input distribution shifts, one can propose a data augmentation solution, i.e., augmenting corruptions during the training. However, the data augmentation approach is still not a perfect solution; a model trained with specific corruptions is only generalized to the seen corruptions, while the unseen corruption generalization is still not achievable [12, 5].

Recent breakthroughs improving robustness have appeared at the different side of researches; a number of studies have shown that improving clean input performances also can help the robust representation against input corruptions [40, 37, 23, 41]. Especially, the most powerful models utilize a large amount of diverse extra data points [35] by semi-supervised learning with a very strong teacher [38, 41] or learning with extra knowledge such as language supervision [30]. However, learning with hundreds of millions of data points is not a always accessible solution for various visual recognition tasks, while learning without extra data still far from the performances of the clean images e.g., a model with 78.9% clean accuracy only showing 28.1% corrupted input accuracy [41]. This implies that we need more high-level understanding of why the deep vision models are not generalizable to unseen distribution shifts.

Recently, Geirhos et al. [11] have shown that strong vision classifiers, e.g., ResNet [13], focus on the spurious texture cues, while shape information is overlooked by the network. To mitigate the texture bias, Geirhos et al. [11] generated abundant texture-ized images by historical artistic paintings ¹¹1The authors used $\sim$ 80K artistic images from Kaggle’s painter by numbers dataset (https://www.kaggle.com/c/painter-by-numbers)., named Stylized ImageNet, based on artistic style transfer methods [10, 21]. By reducing the texture bias of neural networks using Stylized ImageNet, the shape-biased models show robust prediction on distribution shifts and better downstream transfer learning performances on object detection [31]. Bahng et al. [1] have shown that the existing deep models only focusing on small discriminative regions, resulting in being biased towards local cues, such as color and texture. By expanding the effective receptive field of the model, Bahng et al. [1] showed that the texture unbiased performances and performances under distribution shifts are improved. From these observations, we presume that the texture bias is the source of unexpected behavior of deep neural networks.

Previous attempts to mitigate texture bias is usually focusing on utilizing strong human inductive bias. For example, Stylized ImageNet [11] needs a pre-defined specification of texture images to augmentation. Since it is unable to pre-define all possible textures in real-world, Geirhos et al. utilized a large historical artistic painting images for texture images. However, as shown in Figure 1, because the significant gap between artistic images and natural images, the stylized images from artistic images often show low fidelity. As a result, solely training with Stylized ImageNet shows worse performances than training with clean ImageNet (76.13% $\rightarrow$ 60.18% [11]). It shows that careful choice of texture images should be required for better performances. Furthermore, due to the computation and the resource limitation, each image in Stylized ImageNet is only transferred by a random artistic image, i.e., not all $\sim$ 80K images in Kaggle’s Painter by Numbers dataset are used for each image, but only one painting is chosen. This strategy can drastically reduce the diversity of the generated images. To sum up, the pre-computed stylized dataset has low fidelity due to the significant gap between natural images and artistic paintings and low diversity due to the dataset building strategy.

Similarly, Bahng et al. [1] also heavily relyed on the human inductive bias; the method should define an additional “biased” architecture by its design. For example, to reduce the texture bias of ResNet [13], Bahng et al. [1] utilizes BagNet [2] with restricted receptive field by changing the 3 $\times$ 3 convolution filters of ResNet to the 1 $\times$ 1 filters. Although previous attempts with heavy human inductive bias have shown impressive improvements in many robustness benchmarks (e.g., ImageNet-A [16], ImageNet-C [14], unbiased accuracies), it limits the practical usage in the tasks requiring a different inductive bias. Furthermore, theses methods can suffer from the carefully choice of pre-defined configurations, such as texture images [11] or architectural choice [1], while a mis-specification can seriously drops overall performances.

In this paper, we propose a new data augmentation method, StyleAugment, for de-biasing texture biases of deep neural networks by augmenting styles from natural images. Unlike previous style augmentation method [11], our method does not require any pre-defined texture images to avoid heavy dependency on the quality of the pre-defined texture images and sensitivity to mis-specifications. Rather than augmenting styles from pre-defined images, our method extracts styles from the natural images from the training mini-batch. Compared to the pre-stylization strategy by Geirhos et al., our online stylization from the mini-batch strategy has two advantages; the style images are from the natural images as the content images which reduces the semantic gap between the transferred images and the original images; a model can observe diverse styles for each image while the pre-stylization strategy only allows to observe a specific pre-defined texture for each image. StyleAugment also can be viewed as a Mixup variant [42, 40], but StyleAugment does not mix labels to let the model only focus on objectness, not confounding cues, such as backgrounds, textures, or color.

As a model trained with StyleAugment learns a texture de-biased representations, the model outperforms previous de-biasing methods (e.g., ReBias [1], LfF [29]) and in-distribution data augmentation methods (e.g., CutMix [40], Mixup [42]) in ImageNet-9 [1] robustness benchmarks, such as shape-texture de-biased accuracy [1], corruptions (ImageNet-C [14]), natural adversarial samples crawled from the web (ImageNet-A [16]), and occluded samples. Our experimental results on ImageNet-1k [32] and CIFAR-10 [25] show a similar conclusion; StyleAugment improves the robustness against the input distribution shifts, e.g., ImageNet-C or CIFAR-10-C benchmarks [14]. We also validate the design choice of StyleAugment in ImageNet-9. The results show that other design choices, such as changing the stylization method from AdaIN [21] to WCT² [39] and mixing labels as Mixup variants [42, 40] improve the in-distribution generalization, but show performance drops in robustness benchmarks.

2 Related Works

2.1 Unsupervised de-biasing

While previous de-biasing method, such as fairness, assumes the existence of bias labels (e.g., assuming that there exists the protected attribute labels during the training), it is often unrealistic to assume the all bias labels are accessible by the model; labeling requires a huge human annotator costs, and the bias labels are often ill-defined, e.g., it is difficult to categorize natural textures in pre-defined texture sets. To avoid the direct use of bias labels, de-biasing without explicit bias annotation has been actively studied in recent years. Especially, utilizing two networks, one for capturing bias and the other for de-biasing from the biased network, have been one of the major branches in this field. For example, in VQA tasks, the networks are known to be easily biased towards much easier text representations, while ignoring visual representations. To mitigate the issue, RUBi [3] and LearnedMixin [6] adjusted the final logit before the cross-entropy loss of the multi-modal model by using predictions from the uni-modal model where RUBi uses multiplication and LearnedMixinH uses summation. In the uni-modal debiasing tasks, HEX [36] employs a texture extractor model and lets features extracted by the target model be orthogonal to the texture extractor outputs. ReBias [1] makes two models have different capacity so a model is biased towards specific cues, e.g., color or texture. Learning from Failure (LfF) [29] uses the re-weighting cross-entropy where the biased network is trained by the knowledge that networks in the early stage of training rely more on the spurious correlation than networks in the late stage of training.

While previous unsupervised de-biasing methods utilize additional biased model, our method does not need any additional biased model, but only requires a pre-trained style transfer algorithm, such as AdaIN [21]. In the experiments, our method outperforms previous de-biasing methods in ImageNet clean accuracy as well as the ImageNet robustness benchmarks, such as texture de-biased accuarcy [1], ImageNet-C [14] and ImageNet-A [16].

2.2 Data augmentation methods

Since deep models are data hungry [27], synthesizing abundant training images by random crop, random flip, and color jittering have become a rule-of-thumb for enhancing the generalizability of deep models [34, 20]. Recent data augmentation methods have attended on either synthesizing mixed samples [42, 40], or applying a series of very strong image filters, such as equalize, solarize, Cutout [9], to make difficult samples [7, 15, 8]. Our method can be viewed as a variant of Mixup approaches, such as Mixup [42], CutMix [40], while our method does not mix labels of samples to prevent a model attending on spurious correlations, (i.e., textures) rather than the objectness (i.e., shape). In this study, we did not compare our method with augmentation methods with visual filters, such as AugMix [15] or RandAugment [8], because these algorithms need to pre-define augmentation types by a strong human prior knowledge, while our goal is to make universally applicable data augmentation algorithm without a strong inductive bias.

2.3 Stylization-based augmentation methods

Since Geirhos et al. [11] have shown that deep convolutional neural networks are biased towards textures, and a simple stylization-based augmentation (StyleImageNet) can mitigate the texture biases, stylization-based augmentation methods have been studied in a robustness view point. While the previously proposed data augmentation methods showed that the performance improvements in existing robustness benchmarks [42, 40, 15], the data augmentation methods for in-distribution generalization is known to be not helpful when the semantic gap between the training images and the test images is significantly large, such as domain generalization benchmarks [4]. From this motivation, Zhou et al. [43] proposed a Mixup-like stylization augmentation method named MixStyle for domain generalization benchmark. MixStyle mixes two images from different domains to let the model be generalized to diverse domains. Our method focuses on a conventional image recognition task (i.e., the ImageNet classification task) in the de-biasing aspects (i.e., additional ImageNet robustness benchmarks). The stylization-based augmentation strategy also studied by Hong et al. [19], a contemporary work of our study. Hong et al. proposed to utilize both the content loss and the style loss by mixing labels as Mixup [42] or CutMix [40]. Our method does not mix the labels to avoid confounding factors, i.e., textures.

Stylized ImageNet [11] is proposed to mitigate the texture bias of deep models, not only learning shape-biased features but also showing performance improvements in downstream tasks, such as ImageNet classification, object detection, and the ImageNet-C corruption robustness benchmark. Stylized ImageNet, however, has fundamental limitations on its image fidelity (as the significant gap between the artistic paintings and the target natural images) and lack of the style diversity for each image (as the data generation process generating all training images before training). Our method mitigates these two issues by utilizing the natural images from the mini-batch as the style reference, showing better fidelity and diversity as shown in Figure 1.

3 Style Augmentation Without the Pre-defined Styles or Textures

3.1 Preliminary: Arbitrary Style Transfer by Adaptive Instance Normalization (AdaIN)

Arbitrary style transfer tasks [10, 21, 26] aim to generate an image with the given two images, i.e., content and style images. The underlying assumption of style transfer methods since Gatys et al. [10] is that the feature statistics, including mean and standard deviation, represent the texture of images. Primal studies [10] directly optimize the input image to match the feature statistics of the content (mean) and the style (covariance, or Gram matrix) in an iterative manner. Real-time style transfer methods, such as whitening-coloring-transform (WCT) [26] or adaptive instance normalization (AdaIN) [21], approximate the optimization process by replacing feature statistics of content and style images. We use AdaIN style transfer method for real-time data augmentation.

Let $Enc$ and $Dec$ be an encoder and a decoder. AdaIN style transfer first extracts a feature $f$ from the input image $x$ by $f=Enc(x)$ . For the content and style images $c,s$ and their corresponding features $f_{c},f_{s}$ , the AdaIN operation is defined as follows:

\text{AdaIN}(f_{c},f_{s})=\sigma(f_{s})\left(\frac{f_{c}-\mu(f_{c})}{\sigma(f_{c})}+\mu(f_{c})\right),

(1)

where $\mu(\cdot),\sigma(\cdot)$ denotes the feature statistics from the instance normalization. The transferred feature is decoded by the decoder as $\tilde{x}=Dec(AdaIN(f_{c},f_{s}))$ . We use the ImageNet-trained VGG-16 network [33] as the encoder $Enc$ , and the official AdaIN decoder as the decoder $Dec$ following the official implmentation²²2https://github.com/xunhuang1995/AdaIN-style.

⬇

1def train_iteration(inputs, targets):

2 # inputs: a standard Torch image array

3 # targets: a standard Torch label array

4 rand_index = torch.randperm(inputs.size()[0])

6 # styletransfer: an arbitrary function for style transfer. The former argument is the content and the latter is the style.

7 transferred = styletransfer(inputs, inputs[rand_index])

8 inputs = torch.cat([inputs, transferred], dim=0)

9 targets = torch.cat([targets, targets], dim=0)

11 # model is a regular CNN and criterion is the cross entropy function

12 outputs = model(inputs)

13 loss = criterion(outputs, targets)

15 # optim: a standard optimizer such as Adam or AdamP

16 optim.zero_grad()

17 loss.backward()

18 optim.step()

Figure 2: PyTorch pseudo code for StyleAugment

3.2 StyleAugment: a stylization-based augmentation without pre-defined textures

For the given mini-batch $\mathcal{B}=(x_{1},x_{2},\ldots,x_{n})$ with batch size $n$ , StyleAugment generates a new augmented mini-batch $\mathcal{B}^{\prime}$ by augmenting stylized images where the style references are from the other samples in the mini-batch. In each training iteration, StyleAugment randomly combines styles from the mini-batch to make diverse image samples. In practice, we train the model with the original mini-batch and the augmented mini-batch, i.e., $\mathcal{B}+\mathcal{B}^{\prime}$ . We illustrate the PyTorch pseudo code for StyleAugment in Figure 2. StyleAugment does not require any prior knowledge on the pre-defined textures (e.g., Stylized ImageNet [11]) or strong image visual filters (e.g., AutoAugment [7], AugMix [15], RandAugment [8]). StyleAugment training strategy enables the mini-batch-level knowledge interaction as Mixup variants [42, 40], but StyleAugmentation does not mix labels which can make the model rely on spurious correlations, such as textures or backgrounds. We will discuss the details in §4.3.

3.3 Discussions

Compared to the artistic stylized images, natural stylized images show benefits on the image fidelity. Figure 1 shows the example stylized images by artistic style transfer (Figure 1(b)) and by the proposed StyleAugment (Figure 1(c)). Interestingly, the images generated by StyleAugment show diverse distribution shifts, rarely observed in real-world. For example, in the second row of the figure, the background of the globefish seems as a grassfield. Similarly, in the fourth row, the dog shows the “eye” texture from the butterfly image, and the background changes from snow to the grass-like texture. In other words, StyleAugment generates an image with uncommon correlations, such as “a globefish on the grassfield” or “a dog with butterfly patterns”. By augmenting rare combinations of the true objectness and the other confounding factors, StyleAugment makes a model learn de-biased representations to the spurious correlations, such as background, textures, or color.

However, StyleAugment still has a limitation on its image quality. For example, as shown in the first row of Figure 1, StyleAugment tends to preserve the original content information compared to artistic style transfer, but there exists the damage of the content information. This may hurt the final performance of the model as shown in [11]. To understand the trade-off between shape-preserving and stylization, we evaluate our method with a photorealistic style transfer method [39], focusing on the edge preserving, but only transferring color information. Our experimental results show that StyleAugment with AdaIN shows worse results in in-distribution generalization performances, but better results in distribution shift generalization. We will discuss the details in §4.3.

4 Experiments

In this section, we demonstrate the effectiveness of StyleAugment in ImageNet classification tasks. We also compare our design choice with other possible variants of StyleAugment.

4.1 Experiments Settings

Model	Clean	Unbiased Acc [1]	ImageNet-C [14]	ImageNet-A [16]	Occlusion
Vanilla (ResNet-18 [13])^†	90.8	88.8	54.2	24.9	71.3
Biased (BagNet-18 [2])^†	67.7	65.9	31.7	18.8	59.7
LearnedMixin + H [6]^†	64.1	62.7	27.5	15.0	33.5
RUBi [3]^†	90.5	88.6	53.7	27.7	71.3
ReBias [1]^†	91.9	90.5	57.5	29.6	73.4
LfF [29]	93.2	92.0	57.8	28.1	77.0
CutMix [40]	93.8	91.8	54.6	27.1	83.1
Mixup [42]	93.2	91.4	61.5	33.4	77.9
Stylized ImageNet [11]^†	88.4	86.6	61.1	24.6	64.4
StyleAugment	93.8	92.6	65.3	29.6	73.0
StyleAugment + AdamP [17]	95.9	94.8	72.5	32.1	75.8

Table 1: Comparison of state-of-the-art de-biasing and augmentation methods on the ImageNet-9 validation dataset. We measure the ImageNet-9 top-1 validation accuracy (Clean), the unbiase accuracy using texture clustering (Unbiased Acc) following Bahng et al. [1], ImageNet-C top-1 accuracy, ImageNet-A top-1 accuracy, and the top-1 accuracy on occluded samples. The first and the second best methods are denoted in bold numbers and underlined numbers. The rows with ^† denotes the same weights from Bahng et al. [1].

	Clean	ImageNet-A [16]	ImageNet-C [14]	Noise	Blur	Weather	Digital
ResNet-18	69.8	1.1	30.1	30.8	18.9	12.3	9.4
+ StyleAugment	68.3 (-1.5)	2.1 (+1.0)	35.8 (+5.6)	39.0 (8.1)	25.1 (6.2)	19.9 (7.6)	17.2 (7.8)
ResNet-50	76.1	0.8	36.2	41.0	26.1	19.3	16.5
+ StyleAugment	73.8 (-2.3)	3.5 (2.7)	43.6 (7.4)	53.2 (12.1)	38.7 (12.6)	34.5 (15.1)	32.8 (16.3)

Table 2: Impact of StyleAugment on the ImageNet-1k validation dataset. Clean accuracy, ImageNet-A top-1 accuracy, ImageNet-C top-1 accuracy, and the average performances on the ImageNet-C subsets are shown. We use the official ResNet models provided by the PyTorch vision library. Note that rows with “+StyleAugment” are trained without conventional image distortions such as color jittering and lightening, while the baseline methods are trained with the image distortions.

Dataset.

We use the ImageNet [32] classification benchmarks for measuring the effectiveness of our method. In the main experiments, we use the subset of ImageNet with 9 super-class (ImageNet-9) as Bahng et al. [1]. ImageNet-9 consists 54,600 training images and 2,100 test images. We also measure the generalization performance of the models using distribution shifted ImageNet images. First, we measure the unbiased accuracies of ImageNet-9 as Bahng et al. The unbiased accuracy measures the average of combination-wise accuracies where the combination is found by the texture clustering algorithm. The unbiased accuracy is computed by taking an average over all possible combinations of texture clusters and image labels. Showing better unbiased accuracy means the model is less biased towards spurious texture information. ImageNet-C [14] contains 20 corruptions³³3The original ImageNet-C paper suggests to use 15 corruptions for evaluation, while 5 corruptions are remained as “test” set. We use all 20 corruptions to compute ImageNet-C performances of the models., such as Gaussian noise, motion blur, weather changes. We measure the average performances over 20 corruptions and 5 severties. An improved performance on ImageNet-C implies that ImageNet-A [16] is a collection of the failure cases of the ImageNet-trained ResNet-50 [13] model (called “natural adversarial examples”). As previous studies [35, 38, 41, 30] have observed, achieving high ImageNet-A performances without extra dataset is a challenging task, without considering architectural changes [18]. Therefore, better ImageNet-A accuracy (without extra dataset or architecture changes) indirectly shows that the model less relies on shortcuts in the datasets for their predictions. Finally, we report the performances of the center occluded images for measuring occlusion robustness. Following [40], we zero-ed out the center pixels with the 112 $\times$ 112 patch size.

Implementation details:

For fair comparisons, we follow the setting of Bahng et al. [1] for ImageNet-9 experiments; We use the ResNet-18 [13] backbone with the batch size as 128, the initial learning rate as 0.001, and the cosine learning rate scheduling. The models are trained with 120 epochs. We exclude all image distortion-based augmentations, such as color jittering and lightening for precisely understanding the effectiveness of each method. For ImageNet-1k experiments, we use the same setting as ImageNet-9, and the models are trained with 90 training epochs. We trained ResNet-18 and ResNet-50 with StyleAugment for ImageNet-1k.

Finally, we additionally report CIFAR-10 [25] results, where the batch size is set to 128, the number of training epochs is set to 100, and the initial learning rate is set to 0.1 decayed by the cosine annealing scheduling.

Note that StyleAugment doubles the number of training samples as shown in Figure 2. We set the number of clean samples as the half of the batch size in all experiments.

All experiments except ImageNet-9 adopts AdamP optimizer [17] (ImageNet-9 experiments use Adam [24]). We use 2 V100 GPUs for ImageNet experiments, and 1 V100 GPU for CIFAR-10 experiments. NAVER Smart Machine Learning (NSML) [22] is used for all experiments.

Comparison methods.

We compare our StyleAugment with two major directions of researches: unsupervised de-biasing methods and data augmentation without human prior knowledge. We compare StyleAugment with four unsupervised de-biasing methods, including LearnedMixin + H [6], RUBi [3], ReBias [1], and Learning from Failure (LfF) [29]. These methods are designed to overcome the shortcut learning (i.e., bias problems) of the models. We also evaluate three data augmentation methods in the same benchmark: CutMix [40], Mixup [42] and Stylized ImageNet [11]. Other augmentation methods requiring extra knowledge on image distortions (e.g., solorize, equalize) are not compared in this study to avoid unexpected effects by the additional image distortions.

	CIFAR-10 test	CIFAR-10-C
ResNet-18	92.3	66.1
+StyleAugment	92.3 (+0.0)	67.6 (+1.5)

Table 3: Impact of StyleAgument on CIFAR-10 dataset. CIFAR-10 test accuracy and CIFAR-10 corrupted (CIFAR-10-C) performances are shown.

4.2 Main results

ImageNet-9.

Table 1 shows the comparison of de-biasing methods, data augmentation methods, and our StyleAugment. In the table, we first observe that data augmentation methods show remarkable improvements in in-distribution accuracies (93.8% by CutMix and 93.2% by Mixup) comparing to de-biasing methods, such as ReBias and LfF, but their unbiased accuracies are worse than LfF (92.0% by LfF, 91.8% by CutMix and 91.4% by Mixup).

In ImageNet-C benchmarks, de-biasing methods and CutMix shows marginal improvements against the baseline (baseline: 54.2%, ReBias: 57.5%, LfF: 57.8%, CutMix: 54.6%), while Mixup and Stylized ImageNet show significant ImageNet-C performance improvements (Mixup: 61.5%, Stylized ImageNet 61.1%). In particular, Stylized ImageNet shows worse performances in clean, unbiased accuracy, ImageNet-A and occlusion benchmarks, but shows remarkable performance improvement in ImageNet-C. This implies that stylization helps the robustness against common corruptions, but its low visual quality hurts the in-distribution generalization performance as well as other unbiased performances.

We also observe that ReBias (24.9% $\rightarrow$ 29.6%) and LfF (24.9% $\rightarrow$ 28.1%) show remarkable improvements in ImageNet-A performances, while CutMix (27.1%) and Stylized ImageNet (24.6%) show marginal improvements compared to ReBias and LfF. This result implies that de-biasing methods are helpful for improving robustness against spurious correlations.

Finally, we observe that our StyleAugment shows the best performances in in-distribution generalization (same as CutMix – 93.8%), unbiased accuracy (92.6%, while 91.8% CutMix is the second best one), and ImageNet-C benchmark (with a large margin to 61.5% by Mixup, StyleAugment shows 65.3%). Although StyleAugment shows worse ImageNet-A performances, StyleAugment shows the same ImageNet-A performances with ReBias. Since StyleAugment focuses on learning feature distribution shifts, StyleAugment only shows marginal improvements in occlusion benchmark (71.3% $\rightarrow$ 73.0%).

We also report the results trained by StyleAugment and AdamP optimizer [17] showing strong performance improvements in various tasks, including computer vision, robustness, and audio tasks. By using AdamP, we improve the performances of StyleAugment by significant gaps for clean accuracy (93.8% $\rightarrow$ 95.9%), unbiased accuray (92.6% $\rightarrow$ 94.8%), ImageNet-C performance (65.3% $\rightarrow$ 72.5%), ImageNet-A performance (29.6% $\rightarrow$ 32.1%), and occlusion benchmark (73.0% $\rightarrow$ 75.8%).

Model	Clean	Unbiased Acc [1]	ImageNet-C [14]	ImageNet-A [16]	Occlusion
StyleAugment (proposed)	93.8	92.6	65.3	29.6	73.0
StyleAugment + WCT²[39]	94.1 (+0.3)	92.2 (-0.4)	57.0 (-8.3)	31.7 (+2.1)	77.6 (+4.6)
StyleAugment + Label mixing	94.2 (+0.4)	92.7 (+0.1)	62.0 (-3.3)	29.3 (-0.3)	76.0 (3.0)

Table 4: Ablation study on design choices of StyleAugment. We compare the peformances of StyleAugment variants on ImageNet-9 benchmark as Table 1.

ImageNet-1k and CIFAR-10

Table 2 shows the impact of StyleAugment in the ImageNet-1k benchmark. Note that our StyleAugment results did not use the standard augmentations, such as color jittering and lightening. In the table, we observe that the StyleAugmented models show slightly worse performances in the in-distribution generalization performances (due to the lack of the standard augmentations). However, in other robustness benchmarks including ImageNet-A and ImageNet-C, the StyleAugmented models show better performances than the vanilla counterparts even with worse clean accuracies. We observe similar results in Table 3 for CIFAR-10 experiments. The StyleAugmented model shows comparable clean accuracy, but better corruption robustness compared to the vanilla ResNet-18.

Implication and limitation.

As we observed in ImageNet-9 and CIFAR-10-C experiments, our StyleAugment improves the overall performances of the deep models when the number of training data points is small (54K for ImageNet-9, 50K for CIFAR-10). On the other hand, ImageNet-1k results show that StyleAugment improves the robustness of the model, while the model shows performance drops in clean accuracy. We presume that it is because our StyleAugment implementation does not contain the standard color jittering and lightening augmentations. Also, we presume that it is due to the low image fidelity of AdaIN transferred images. As shown in Geirhos et al. [11], the performance can be improved by applying additional “fine-tuning” process on the clean training images. We did not test fine-tuning strategy in this study, but we assume the image quality affects a lot to the in-distribution performances. As a primal study, in the following section, we investigate the effect of the style transfer method to in-distribution generalization and out-of-distribution generalization.

4.3 Ablation studies

We conduct ablation studies of design choices of StyleAugment in ImageNet-9. We tested two variants of StyleAugment in the image quality and the target label. First, as we discussed in the previous sections, the image quality by AdaIN is not perfect (Figure 1). Especially, AdaIN loses edge and shape information of the original image. We use a photorealistic style transfer model WCT², focusing on preserving edge information by the Haar wavelet transform. Note that WCT² transfers color and light information, while AdaIN transfers texture information. In the second row of Table 4, we report the StyleAugment performances when AdaIN encoder and decoder modules are changed to WCT² encoder and decoder. For real-time computation, we modify the original WCT² algorithm from whitening-coloring-transform [26] to adaptive instance normalization (Equation (1)). We observe that StyleAugment with WCT² shows better in-distribution generalization performance (94.1%) than StyleAugment with AdaIN (93.8%), as well as ImageNet-A (29.6% $\rightarrow$ 31.7%) and occlusion performances (73.0% $\rightarrow$ 77.6%). However, StyleAugment with WCT² shows drastic performance drops in ImageNet-C (65.3% $\rightarrow$ 57.0%) by losing texture information. To sum up, the better image quality by content preservation improves in-distribution generalization performances, but cannot generalize the corruption robustness by losing advantages of texture transferring.

We also test a variant on the target labels of StyleAugment. First, we mix content and style labels for the target label as mixup variants. We observe similar phenomenon to the results of WCT²; it improves clean accuracy, but drops ImageNet-C performance. We assume that it is because StyleAugment allows the model to observe a sample with various spurious cues, such as texture and background (as shown in Figure 1). However, if we mix the content and style labels, the model can attend unexpected prediction cues, rather than object information itself.

5 Conclusion

In this paper, we propose a new data augmentation method using stylization methods. The proposed StyleAugment generates augmented images by applying AdaIN style transfer between the mini-batch samples, while the previous stylization-based approach, Stylized ImageNet, uses pre-defined artistic paintings. Compared to Stylized ImageNet, the model trained with our StyleAugment can observe more diverse and realistic images with various confounding factors such as backgrounds, textures. In the experiments, we show that our StyleAugment shows improvements in robustness benchmarks, such as corruption robustness, while showing comparing or outperforming in-distribution generalization performances. While changes in the stylization method or the label mixing strategy improve the in-distribution generalization performances, the changes show worse robustness performances. It shows that our StyleAugment strategy makes images with various spurious correlations from style images, e.g., texture, resulting in improvements of robustness performances.

References

[1] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning (ICML), 2020.
[2] Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations (ICLR), 2019.
[3] Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems, pages 839–850, 2019.
[4] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. arXiv preprint arXiv:2102.08604, 2021.
[5] Sanghyuk Chun, Seong Joon Oh, Sangdoo Yun, Dongyoon Han, Junsuk Choe, and Youngjoon Yoo. An empirical evaluation on robustness and uncertainty of regularization methods. In ICML Workshop on Uncertainty and Robustness in Deep Learning., 2019.
[6] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4069–4082, 2019.
[7] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
[8] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
[9] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
[10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[11] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR), 2019.
[12] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, pages 7538–7550, 2018.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[14] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.
[15] Dan Hendrycks*, Norman Mu*, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020.
[16] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021.
[17] Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In International Conference on Learning Representations (ICLR), 2021.
[18] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In International Conference on Computer Vision (ICCV), 2021.
[19] Minui Hong, Jinwoo Choi, and Gunhee Kim. Stylemix: Separating content and style for enhanced data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14862–14870, 2021.
[20] Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
[21] Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In International Conference on Computer Vision (ICCV), 2017.
[22] Hanjoo Kim, Minkyu Kim, Dongjoo Seo, Jinwoong Kim, Heungseok Park, Soeun Park, Hyunwoo Jo, KyungHyun Kim, Youngil Yang, Youngkwan Kim, et al. NSML: Meet the MLaaS platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
[23] Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, pages 5275–5285. PMLR, 2020.
[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[26] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, 2017.
[27] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018.
[28] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484, 2019.
[29] Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. In Advances in Neural Information Processing Systems, 2020.
[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[34] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016.
[35] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18583–18599. Curran Associates, Inc., 2020.
[36] Haohan Wang, Zexue He, and Eric P. Xing. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations, 2019.
[37] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and Quoc V Le. Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 819–828, 2020.
[38] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
[39] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. Photorealistic style transfer via wavelet transforms. In International Conference on Computer Vision (ICCV), 2019.
[40] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), 2019.
[41] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling imagenet: from single to multi-labels, from global to localized labels. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[42] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
[43] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In ICLR, 2021.

StyleAugment: Learning Texture De-biased Representations by Style Augmentation without Pre-defined Textures