compatibility=false
PRIME: A Few Primitives Can Boost Robustness to Common Corruptions
Abstract
Despite their impressive performance on image classification tasks, deep networks have a hard time generalizing to unforeseen corruptions of their data. To fix this vulnerability, prior works have built complex data augmentation strategies, combining multiple methods to enrich the training data. However, introducing intricate design choices or heuristics makes it hard to understand which elements of these methods are indeed crucial for improving robustness. In this work, we take a step back and follow a principled approach to achieve robustness to common corruptions. We propose PRIME, a general data augmentation scheme that relies on simple yet rich families of max-entropy image transformations. PRIME outperforms the prior art in terms of corruption robustness, while its simplicity and plug-and-play nature enable combination with other methods to further boost their robustness. We analyze PRIME to shed light on the importance of the mixing strategy on synthesizing corrupted images, and to reveal the robustness-accuracy trade-offs arising in the context of common corruptions. Finally, we show that the computational efficiency of our method allows it to be easily used in both on-line and off-line data augmentation schemes111Our code is available at https://github.com/amodas/PRIME-augmentations.
1 Introduction
Deep image classifiers do not work well in the presence of various types of distribution shifts [14, 18, 42]. Most notably, their performance can severely drop when the input images are affected by common corruptions that are not contained in the training data, such as digital artefacts, low contrast, or blurs [21, 29]. In general, “common corruptions” is an umbrella term coined to describe the set of all possible distortions that can happen to natural images during their acquisition, storage, and processing lifetime, which can be very diverse. Nevertheless, while the space of possible perturbations is huge, the term “common corruptions” is generally used to refer to image transformations that, while degrading the quality of the images, still preserve their semantic information.
Building classifiers that are robust to common corruptions is far from trivial. A naive solution is to include data with all sorts of corruptions during training, but the sheer scale of all possible types of typical perturbations that might affect an image is simply too large. Moreover, the problem is per se ill-defined since there exists no formal description of all possible common corruptions.
To overcome this issue, the research community has recently favoured increasing the “diversity” of the training data via data augmentation schemes [10, 22, 20]. Intuitively, the hope is that showing very diverse augmentations of an image to a network would increase the chance that the latter becomes invariant to some common corruptions. Still, covering the full space of common corruptions is hard. Hence, current literature has mostly resorted to increasing the diversity of augmentations by designing intricate data augmentation pipelines, e.g., introducing DNNs for generating varied augmentations [20, 5], or coalescing multiple techniques [44], and thus achieve good performance on different benchmarks. This strategy, though, leaves a big range of unintuitive design choices, making it hard to pinpoint which elements of these methods meaningfully contribute to the overall robustness. Meanwhile, the high complexity of recent methods [44, 5] makes them impractical for large-scale tasks. Whereas, some methods are tailored to particular datasets and might not be general enough. Nonetheless, the problem of building robust classifiers is far from completely solved, and the gap between robust and standard accuracy is still large.

In this work, we take a step back and provide a systematic way for designing a simple, yet effective data augmentation scheme. By focusing on first principles, we formulate a new mathematical model for semantically-preserving corruptions, and build on basic concepts to characterize the notions of transformation strength and diversity using a few transformation primitives. Relying on this model, we propose PRIME, a data augmentation scheme that draws transformations from a max-entropy distribution to efficiently sample from a large space of possible distortions (see Fig. 1). The performance of PRIME, alone, already tops the current baselines on different common corruption datasets, whilst it can also be combined with other methods to further boost their performance. Moreover, the simplicity and flexibility of PRIME allows to easily understand how each of its components contributes to improving robustness.
Altogether, the main contributions of our work include:
-
We introduce PRIME, a simple method that is built on a few guiding principles, which efficiently boosts robustness to common corruptions.
-
We experimentally show that PRIME, despite its simplicity, achieves state-of-the-art robustness on multiple corruption benchmarks.
-
Last, our thorough ablation study sheds light on the necessity of having diverse transformations, on the role of mixing in the success of current methods, on the potential robustness-accuracy trade-off, and on the importance of online augmentations.
Overall, PRIME is a simple model-based scheme that can be easily understood, ablated, and tuned. Our work is an important step in the race for robustness against common corruptions, and we believe that it has the potential to become the new baseline for learning robust classifiers.
2 General model of visual corruptions
In this work, motivated by the “semantically-preserving” nature of common corruptions, we define a new model of typical distortions. Specifically, we leverage the long tradition of image processing in developing techniques to manipulate images while retaining their semantics and construct a principled framework to characterize a large space of visual corruptions.
Let be a continuous image222In practice, we will work with discrete images on a regular grid. mapping pixel coordinates to RGB values. We define our model of common corruptions as the action on of the following additive subgroup of the near-ring of transformations [4]
(1) |
where and are random primitive transformations which distort along the spectral (), spatial (), and color () domains. As we will see, defining each of these primitives in a principled and coherent fashion will be enough to construct a set of perturbations which covers most types of visual corruptions.
To guarantee as much diversity as possible in our model, we follow the principle of maximum entropy to define our distributions of transformations [8]. Note that using a set of augmentations that guarantees maximum entropy comes naturally when trying to optimize the sample complexity derived from certain information-theoretic generalization bounds, both in the clean [45] and corrupted settings [28]. Specifically, the principle of maximum entropy postulates favoring those distributions that are as unbiased as possible given the set of constraints that define a family of distributions. In our case, these constraints are given in the form of an expected strength , some boundary conditions, e.g., the displacement field must be zero at the borders of an image, and finally the desired smoothness level . The principle of smoothness helps formalize the notion of physical plausibility, as most naturally occurring processes are smooth.
Formally, let denote the space of all images, and let be a random image transformation distributed according to the law . Further, let us define a set of constraints , which restricts the domain of applicability of , i.e., , and where denotes the space of functions . The principle of maximum entropy postulates using the distribution which has maximum entropy given the constraints:
(2) | ||||
subject to |
where represents the entropy of the distribution [8]. In its general form, solving Eq. 2 for any set of constraints is intractable. In Appendix 0.A, we formally derive the analytical expressions for the distributions of each of our family of transformations, by leveraging results from statistical physics [1].
In what follows, we describe the analytical solutions to Eq. 2 for each of our basic primitives. In general, these distributions are governed by two parameters: to control smoothness, and to control strength. These transformations fall back to identity mappings when , independently of .
Spectral domain We parameterize the distribution of random spectral transformations using random filters , such that the transformation output follows
(3) |
where, is the convolution operator, represents a Dirac delta, i.e., identity filter, and is implemented in the discrete grid as an FIR filter of size with i.i.d random entries distributed according to . Here, governs the transformation strength, while larger yields filters of higher spectral resolution. The bias retains the output close to the original image.
Spatial domain We model our distribution of random spatial transformations, which apply random perturbations over the coordinates of an image, as
(4) |
This model has been recently proposed in [34] to define a distribution of random smooth diffeomorphisms in order to study the stability of neural networks to small spatial transformations. To guarantee smoothness but preserve maximum entropy, the authors propose to parameterize the vector field as
(5) |
where . Such choice guarantees that the resulting mapping is smooth according to the cut frequency , while determines its strength.
Color domain Following a similar approach, we define the distribution of random color transformations as random mappings between color spaces
(6) |
where , with denoting elementwise multiplication. Again, controls the smoothness of the transformations and their strength. Compared to Eq. 5, the coefficients in Eq. 6 are not weighted by the inverse of the frequency, and have constant variance. In practice, we observe that reducing the variance of the coefficients for higher frequencies creates color mappings that are too smooth and almost imperceptible, so we decided to drop this dependency.
Finally, we note that PRIME is very flexible with respect to its core primitives. In particular, PRIME can be easily extended to include other distributions of maximum entropy transformations that suit an objective task. For example, one might add the distribution of maximum entropy additive perturbations given by , where . Nonetheless, since most benchmarks of visual corruptions disallow the use of additive perturbations during training [21], we do not include an additive perturbation category.
Overall, as demonstrated by our results in Secs. 4.2 and 5.2, our model is very flexible and can cover a large part of the semantic-preserving distortions. It also allows to easily control the strength and style of the transformations with just a few parameters. Moreover, changing the transformation strength enables to control the trade-off between corruption robustness and standard accuracy, as shown in Sec. 5.3. In what follows, we use this model to design an efficient augmentation scheme to build classifiers robust to common corruptions.
3 PRIME: A simple augmentation scheme
We now introduce PRIME, a simple yet efficient augmentation scheme that uses our PRImitives of Maximum Entropy to confer robustness against common corruptions. The pseudo-code of PRIME is given in Algorithm 1, which draws a random sample from Eq. 1 using a convex combination of a composition of basic primitives. Below we describe the main implementation details.
Parameter selection It is important to ensure that the semantic information of an image is preserved after it goes through PRIME. As measuring semantic preservation quantitatively is not simple, we subjectively select each primitive’s parameters based on visual inspection, ensuring maximum permissible distortion while retaining the semantic content of the image. However, to avoid relying on a specific strength for each transformation, PRIME stochastically generates augmentations of different strengths by sampling from a uniform distribution, with different minimum and maximum values for each primitive. Figure 2 shows some visual examples for each kind of transformation, while additional visual examples along with the details of all the parameters can be found in Appendix 0.B.
For the color primitive, we observed that fairly large values for (in the order of ) are important for covering a large space of visual distortions. Unfortunately, implementing such a transformation can be memory inefficient. To avoid this issue, PRIME uses a slight modification of Eq. 6 and combines a fixed number of consecutive frequencies randomly chosen in the range .
Mixing transformations The concept of mixing has been a recurring theme in the augmentation literature [48, 47, 22, 44] and PRIME follows the same trend. In particular, Algorithm 1 uses a convex combination of basic augmentations consisting of the composition of of our primitive transformations. In general, the convex mixing procedure (i) broadens the set of possible training augmentations, and (ii) ensures that the augmented image stay close to the original one. We later provide empirical results which underline the efficacy of mixing in Sec. 5.2. Overall, the exact mixing parameters are provided in Appendix 0.B. Note that, the basic skeleton of PRIME is similar to that of AugMix. However, as we will see next, incorporating our maximum entropy transformations leads to significant gains in common corruptions robustness over AugMix.

4 Performance analysis
In this section, we compare the classification performance of our method on multiple datasets with that of two current approaches: AugMix and DeepAugment (DA). We illustrate that PRIME significantly advances the corruption robustness over that of AugMix and DeepAugment on all the benchmarks. We also show that our method yields additional benefits when employed in concert with unsupervised domain adaptation [39].
Dataset | Method | Clean | Common Corruption | |
---|---|---|---|---|
Acc () | Acc () | mCE () | ||
C-10 | Standard | 95.0 | 74.0 | 24.0 |
AugMix | 95.2 | 88.6 | 11.4 | |
PRIME | 94.2 | 89.8 | 10.2 | |
C-100 | Standard | 76.7 | 51.9 | 48.1 |
AugMix | 78.2 | 64.9 | 35.1 | |
PRIME | 78.4 | 68.2 | 31.8 | |
IN-100 | Standard | 88.0 | 49.7 | 100.0 |
AugMix | 88.7 | 60.7 | 79.1 | |
DA | 86.3 | 67.7 | 68.1 | |
PRIME | 85.9 | 71.6 | 61.0 | |
DA+AugMix | 86.5 | 73.1 | 57.3 | |
DA+PRIME | 84.9 | 74.9 | 54.6 | |
IN | Standard∗ | 76.1 | 38.1 | 76.7 |
AugMix∗ | 77.5 | 48.3 | 65.3 | |
DA∗ | 76.7 | 52.6 | 60.4 | |
PRIME† | 77.0 | 55.0 | 57.5 | |
DA+AugMix | 75.8 | 58.1 | 53.6 | |
DA+PRIME† | 75.5 | 59.9 | 51.3 |
4.1 Training setup
We consider the CIFAR-10 (C-10), CIFAR-100 (C-100) [25], ImageNet-100 (IN-100) and ImageNet (IN) [11] datasets. IN-100 is a -class subset of IN obtained by selecting every th class in WordNet ID order. We train a ResNet-18 [19] on C-10, C-100 and IN-100; and a ResNet-50 on IN for epochs. Following AugMix, and for a complete comparison, we also integrate the Jensen-Shannon divergence (JSD)-based consistency loss in PRIME which compels the network to learn similar representations for differently augmented versions of the same input image. Detailed training setup appears in Appendix 0.C. We evaluate our trained models on the common corrupted versions (C-10-C, C-100-C, IN-100-C, IN-C) of the aforementioned datasets. The common corruptions [21] constitute image distortions each applied with different severity levels. These corruptions can be grouped into four categories, viz. noise, blur, weather and digital.
4.2 Robustness to common corruptions
In order to assess the effectiveness of PRIME, we evaluate its performance against C-10, C-100, IN-100 and IN common corruptions. The results are summarized in Tab. 1333We provide the per-corruption performance of every method in Appendix 0.H.. Amongst individual methods, PRIME yields superior results compared to those obtained by AugMix and DeepAugment alone and advances the baseline performance on the corrupted counterparts of the four datasets. As listed, PRIME pushes the corruption accuracy by and on C-10-C and C-100-C respectively over AugMix. On IN-100-C, a more complicated dataset, we observe significant improvements wherein PRIME outperforms AugMix by . In fact, this increase in performance hints that our primitive transformations are actually able to cover a larger space of image corruptions, compared to the restricted set of AugMix. Interestingly, the random transformations in PRIME also lead to a boost in corruptions accuracy over DeepAugment despite the fact that DeepAugment leverages additional knowledge to augment the training data via its use of pre-trained architectures. Moreover, PRIME provides cumulative gains when combined with DeepAugment, reducing the mean corruption error (mCE) of prior art (DA+AugMix) by on IN-100-C. Lastly, we also evaluate the performance of PRIME on full IN-C. However, we do not use JSD in order to reduce computational complexity. Yet, even without the JSD loss, PRIME outperforms, in terms of corruption accuracy, both AugMix (with JSD) and DeepAugment by and respectively, while the mCE is reduced by and . And last, when PRIME is combined with DeepAugment, it also surpasses the performance of DA+AugMix (with JSD), reaching a corruption accuracy of almost and an mCE of . Note here, that, not only PRIME achieves superior robustness, but it does so efficiently. Compared to standard training on IN-100, AugMix requires 1.20x time and PRIME requires 1.27x. In contrast, DA is tedious and we do not measure its runtime since it also requires the training of two large image-to-image networks for producing augmentations, and can only be applied offline.
4.3 Unsupervised domain adaptation
IN-100-C acc. () | IN-100 () | ||||
---|---|---|---|---|---|
Method | w/o | single | partial | full | single |
Standard | 49.7 | 53.8 | 62.0 | 63.9 | 88.1 |
AugMix | 60.7 | 65.5 | 71.3 | 73.0 | 88.3 |
DA | 67.7 | 70.2 | 72.7 | 74.6 | 86.3 |
PRIME | 71.6 | 73.5 | 75.3 | 76.6 | 85.7 |
Recently, robustness to common corruptions has also been of significant interest in the field of unsupervised domain adaptation [2, 39]. The main difference is that, in domain adaptation, one exploits the limited access to test-time corrupted samples to adjust certain network parameters. Hence, it would be interesting to investigate the utility of PRIME under the setting of domain adaption.
To that end, we combine our method with the adaption trick of [39]. Specifically, we adjust the batch normalization (BN) statistics of our models using a few corrupted samples. Suppose , are the BN mean and variance estimated from the training data, and , are the corresponding statistics computed from unlabelled, corrupted test samples, then we re-estimate the BN statistics as follows.
(7) |
We consider three adaptation scenarios: single sample (), partial () and full () adaptation. Here, we do not perform parameter tuning for . As shown in Tab. 2, simply correcting BN statistics using as little as corrupted samples pushes the corruption accuracy of PRIME from to . In general, PRIME yields cumulative gains in combination with adaptation and has the best IN-100-C accuracy.
5 Robustness insights using PRIME
In this section, we exploit the simplicity and the controllable nature of PRIME to investigate different aspects behind robustness to common corruptions. We first analyze how each transformation domain contributes to the overall robustness of the network. Then, we empirically locate and justify the benefits of mixing the transformations of each domain. Moreover, we demonstrate the existence of a robustness-accuracy trade-off, and, finally, we comment on the low-complexity benefits of PRIME in different data augmentation settings.
5.1 Contribution of transformations
Transform | IN-100-C | Noise | Blur | Weather | Digital | IN-100 |
---|---|---|---|---|---|---|
None | 49.7 | 27.3 | 48.6 | 54.8 | 62.6 | 88.0 |
64.1 | 60.7 | 55.4 | 66.6 | 72.9 | 87.3 | |
53.8 | 30.1 | 56.2 | 57.6 | 65.4 | 87.0 | |
59.9 | 67.4 | 52.6 | 54.4 | 67.1 | 86.9 | |
+ | 64.5 | 58.5 | 57.3 | 66.8 | 73.9 | 87.7 |
+ | 67.5 | 77.2 | 55.7 | 65.3 | 74.2 | 87.1 |
+ | 63.3 | 74.7 | 57.4 | 56.2 | 67.8 | 86.2 |
++ | 68.8 | 78.8 | 58.3 | 66.0 | 74.8 | 87.1 |
We want to understand how the transformations in each domain of Eq. 1 contribute to the overall robustness. To that end, we conduct an ablation study on IN-100-C by training a ResNet-18 with the max-entropy transformations of PRIME individually or in combination. As shown in Tab. 3, spectral transformations mainly help against blur, weather and digital corruptions. Spatial operations also improve on blurs, but on elastic transforms as well (digital). On the contrary, color transformations excel on noises and certain high frequency digital distortions, e.g., pixelate and JPEG artefacts, and have minor effect on weather changes. Besides, incrementally combining the transformations lead to cumulative gains e.g., spatial+color help on both noises and blurs. Yet, for obtaining the best results, the combination of all transformations is required. This means that each transformation increases the coverage over the space of possible distortions and the increase in robustness comes from their cumulative contribution.
5.2 The role of mixing
In most data augmentation methods, besides the importance of the transformations themselves, mixing has been claimed as an essential module for increasing diversity in the training process [48, 47, 22, 44]. In our attempt to provide insights on the role of mixing in the context of common corruptions, we found out that it is capable of constructing augmented images that look perceptually similar to their corrupted counterparts. In fact, the improvements on specific corruption types observed in Tab. 3 can be largely attributed to mixing. As exemplified in Fig. 3, careful combinations of spectral transformations with the clean image introduce brightness and contrast-like artefacts that look similar to the corresponding corruptions in IN-C. Also, combining spatial transformations creates blur-like artefacts that look identical to zoom blur in IN-C. Finally, notice how mixing color transformations helps fabricate corruptions of the “noise” category. This means that the max-entropy color model of PRIME enables robustness to different types of noise without explicitly adding any during training.




Note that one of the main goals of data augmentation is to achieve maximum coverage of the space of possible distortions using a limited transformation budget, i.e., within a few training epochs. The principle of max-entropy guarantees this within each primitive, but the effect of mixing on the overall space is harder to quantify. In this regard, we can use the distance in the embedding space, , of a SimCLRv2 [7] model as a proxy for visual similarity [49, 30]. We are interested in measuring how mixing the base transformations changes the likelihood that an augmentation scheme generates some sample during training that is visually similar to some of the common corruptions. To that end, we randomly select training images from IN, along with their ( corruptions of severity levels) associated common corruptions , and generate for each of the clean images another transformed samples using each augmentation scheme. Moreover, for each corruption we find its closest neighbor from the set of generated samples using the cosine distance in the embedding space. Our overall measure of fitness is
(8) |
Method | Min. cosine distance () | |
---|---|---|
Avg. () | Median () | |
None (clean) | 25.38 | 6.44 |
AugMix (w/o mix) | 20.57 | 3.56 |
PRIME (w/o mix) | 10.61 | 1.88 |
AugMix | 17.48 | 2.61 |
PRIME | 7.71 | 1.61 |
Table 4 shows the values of this measure applied to AugMix and PRIME, with and without mixing. For reference, we also report the values of the clean (no transform) images . More percentile scores can be found in Appendix 0.F. Clearly, mixing helps reduce the distance between the common corruptions and the augmented samples from both methods. We also observe that PRIME, even with only augmentations per image – in the order of the number of training epochs – can generate samples that are twice as close to the common corruptions as AugMix. In fact, the feature similarity between training augmentations and test corruptions was also studied in [29], with an attempt to justify the good performance of AugMix on C-10. Yet, we see that the fundamental transformations of AugMix are not enough to span a broad space guaranteeing high perceptual similarity to IN-C. The significant difference in terms of perceptual similarity in Tab. 4 between AugMix and PRIME may explain the superior performance of PRIME on IN-100-C and IN-C (cf. Tab. 1)444A visualization of the augmented space using PCA can be found in Appendix 0.G..
5.3 Robustness vs. accuracy trade-off
An important phenomenon observed in the literature of adversarial robustness is the so-called robustness-accuracy trade-off [16, 43, 35], where technically adversarial training [27] with smaller perturbations (typically smaller ) results in models with higher standard but lower adversarial accuracy, and vice versa. In this sense, we want to understand if the strength of the image transformations introduced through data augmentations in PRIME can also cause such phenomenon in the context of robustness to common corruptions. As described in Sec. 2, each of the transformations of PRIME has a strength parameter , which can be seen as the analogue of in adversarial robustness. Hence, we can easily reduce or increase the strength of the transformations by setting , where . Then, by training a network for different values of we can monitor its accuracy on the clean and the corrupted datasets.
We train a ResNet-18 on C-10 and IN-100 using the setup of Sec. 4.1. For reducing complexity, we do not use the JSD loss and train for epochs. This sub-optimal setting could cause some performance drop compared to the results of Tab. 1, but we expect the overall trends in terms of accuracy and robustness to be preserved. Regarding the scaling of the parameters’ strength, for C-10 we set and sample values spaced evenly on a log-scale, while for IN-100 we set and we sample values.


The results are presented in Fig. 4. For both C-10 and IN-100, it seems that there is a sweet spot for the scale around and respectively, where the accuracy on common corruptions reaches its maximum. For smaller than these values, we observe a clear trade-off between validation and robust accuracy. While the robustness to common corruptions increases, the validation accuracy decays. However, for greater than the sweet-spot values, we observe that the trade-off ceases to exist since both the validation and robust accuracy present similar behaviour (slight decay). In fact, these observations indicate that robust and validation accuracies are not always positively correlated and that one might have to slightly sacrifice validation accuracy in order to achieve robustness.
5.4 Sample complexity
Finally, we investigate the necessity of performing augmentation during training (on-line augmentation), compared to statically augmenting the dataset before training (off-line augmentation). On the one hand, on-line augmentation is useful when the dataset is huge and storing augmented versions requires a lot of memory. Besides, there are cases where offline augmentation is not feasible as it relies on pre-trained or generative models which are unavailable in certain scenarios, e.g., DeepAugment [20] or AdA [5] cannot be applied on C-100. On the other hand, off-line augmentation may be necessary to avoid the computational cost of generating augmentations during training.
To this end, for each of the C-10 and IN-100 training sets, we augment them off-line with i.i.d. PRIME transformed versions. Afterwards, for different values of , we train a ResNet-18 on the corresponding augmented dataset and report the accuracy on the validation set and the common corruptions. For the training setup, we follow the settings of Sec. 4.1, but without JSD loss. Also, since we increase the size of the training set by , we also divide the number of training epochs by the same factor, in order to keep the same overall number of gradient updates.


The performance on common corruptions is presented in Fig. 5. The first thing to notice is that, even for , the obtained robustness to common corruptions is already quite good. In fact, for IN-100 the accuracy () is already better than AugMix ( with JSD loss cf. Tab. 1). Regarding C-10, we observe that for the actual difference with respect to the on-line augmentation is almost negligible ( vs. ), especially considering the overhead of transforming the data at every epoch. Technically, this means that augmenting C-10 with PRIME counterparts is enough for achieving good robustness to common corruptions. Finally, we also see in Fig. 5 that the corruption accuracy on IN-100 presents a very slow improvement after . Comparing the accuracy at this point () to the one obtained with on-line augmentation and without JSD ( cf. Tab. 3) we observe a gap of . Hence, given the cost of on-line augmentation on such large scale datasets, simply augmenting the training with extra PRIME samples presents a good compromise for achieving competitive robustness. Nevertheless, the increase of introduced by on-line augmentation is rather significant, hinting that generating transformed samples during training might be necessary for maximizing performance. In this regard, the lower computational complexity of PRIME allows it to easily achieve this gain through on-line augmentation, since it only requires additional training time compared to standard training, and only compared to AugMix, but with much better performance. This can be a significant advantage with respect to complex methods, like DeepAugment, that cannot be even applied on-line (require heavy pretraining).
6 Related work
Common corruptions Towards evaluating the robustness of deep neural networks (DNNs) to natural distribution shifts, the authors in [21] proposed common corruptions benchmarks (CIFAR-10-C and ImageNet-C) constituting 15 realistic image distortions. Later studies [20] considered the example of blurring and demonstrated that performance improvements on these common corruptions do generalize to real-world images, which supports the use of common corruptions benchmarks. Recent work [29] showed that current augmentation techniques undergo a performance degradation when evaluated on corruptions that are perceptually dissimilar from those in ImageNet-C. In addition to common corruptions, current literature studies other benchmarks e.g., adversarially filtered data [23], artistic renditions [20] and in-domain datasets [36]. In Appendix 0.J, we show that PRIME also improves robustness on these benchmarks.
Improving corruption robustness Data augmentation has been the central pillar for improving the generalization of DNNs [12, 48, 10, 47, 26]. A notable augmentation scheme for endowing corruption robustness is AugMix [22], which employs a careful combination of stochastic augmentation operations and mixing. AugMix attains significant gains on CIFAR-10-C, but it does not perform as well on larger benchmarks like ImageNet-C. DeepAugment (DA) [20] addresses this issue and diversifies the space of augmentations by introducing distorted images computed by perturbing the weights of image-to-image networks. DA, combined with AugMix, achieves the current state-of-the-art on ImageNet-C. Other schemes include: (i) worst-case noise training [37] or data augmentation through Fourier-based operations [41], (ii) inducing shape bias through stylized images [17], (iii) adversarial counterparts of DeepAugment [5] and AugMix [44], (iv) pre-training and/or adversarial training [46, 24], (v) constraining the total variation of convolutional layers [38] or compressing the model [13] and (vi) learning the image information in the phase rather than amplitude [6] Besides, Vision Transformers [15] have been shown to be more robust to common corruptions than standard CNNs [3, 31] when trained on big data. It would thus be interesting to study the effect of extra data alongside PRIME in future works. Finally, unsupervised domain adaptation [2, 39] using a few corrupted samples has also been shown to provide a considerable boost in corruption robustness. Nonetheless, domain adaptation is orthogonal to this work as it requires knowledge of the target distribution.
7 Concluding remarks
We took a systematic approach to understand the notion of common corruptions and formulated a universal model that encompasses a wide variety of semantic-preserving image transformations. We then proposed a novel data augmentation scheme called PRIME, which instantiates our model of corruptions, to confer robustness against common corruptions. From a practical perspective, our method is principled yet efficient and can be conveniently incorporated into existing training procedures. Moreover, it yields a strong baseline on existing corruption benchmarks outperforming current standalone methods. Additionally, our thorough ablations demonstrate that diversity among basic augmentations (primitives) – which AugMix and other approaches lack – is essential, and that mixing plays a crucial role in the success of both prior methods and PRIME. In general, while complicated methods like DeepAugment perform well, it is difficult to understand, ablate and apply these online. Instead, we show that a simple model-based stance with a few guiding principles can be used to build a very effective augmentation scheme that can be easily understood, ablated and tuned. We believe that our insights and PRIME pave the way for building robust models in real-life scenarios. PRIME, for instance, provides a ready-to-use recipe for data-scarce domains such as medical imaging.
Acknowledgments
We thank Alessandro Favero for the fruitful discussions and feedback. This work has been partially supported by the CHIST-ERA program under Swiss NSF Grant 20CH21_180444, and partially by Google via a Postdoctoral Fellowship and a GCP Research Credit Award.
References
- [1] Beale, P.: Statistical Mechanics. Elsevier (1996)
- [2] Benz, P., Zhang, C., Karjauv, A., Kweon, I.S.: Revisiting batch normalization for improving corruption robustness. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
- [3] Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of Transformers for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
- [4] Binder, F., Aichinger, E., Ecker, J., Nöbauer, C., Mayr, P.: Algorithms for near-rings of non-linear transformations. In: Proceedings of the International Symposium on Symbolic and Algebraic Computation. Association for Computing Machinery (2000)
- [5] Calian, D.A., Stimberg, F., Wiles, O., Rebuffi, S.A., Gyorgy, A., Mann, T., Gowal, S.: Defending against image corruptions through adversarial augmentations. arXiv preprint arXiv:2104.01086 (2021)
- [6] Chen, G., Peng, P., Ma, L., Li, J., Du, L., Tian, Y.: Amplitude-phase recombination: Rethinking robustness of convolutional neural networks in frequency domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
- [7] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems (2020)
- [8] Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience (2006)
- [9] Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., Hein, M.: Robustbench: a standardized adversarial robustness benchmark. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021)
- [10] Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
- [11] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009)
- [12] DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
- [13] Diffenderfer, J., Bartoldson, B.R., Chaganti, S., Zhang, J., Kailkhura, B.: A winning hand: Compressing deep networks can improve out-of-distribution robustness. In: Advances in Neural Information Processing Systems (Dec 2021)
- [14] Dodge, S., Karam, L.: Understanding how image quality affects deep neural networks. In: 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX) (2016)
- [15] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
- [16] Fawzi, A., Fawzi, O., Frossard, P.: Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning 107(3), 481–508 (2018)
- [17] Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations (2019)
- [18] Geirhos, R., Temme, C.R.M., Rauber, J., Schütt, H.H., Bethge, M., Wichmann, F.A.: Generalisation in humans and deep neural networks. In: Advances in Neural Information Processing Systems (2018)
- [19] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016)
- [20] Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., Gilmer, J.: The many faces of robustness: A critical analysis of out-of-distribution generalization. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
- [21] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations (2019)
- [22] Hendrycks*, D., Mu*, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple method to improve robustness and uncertainty under data shift. In: International Conference on Learning Representations (2020)
- [23] Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
- [24] Kireev, K., Andriushchenko, M., Flammarion, N.: On the effectiveness of adversarial training against common corruptions. arXiv preprint arXiv:2103.02325 (2021)
- [25] Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
- [26] Lopes, R.G., Yin, D., Poole, B., Gilmer, J., Cubuk, E.D.: Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611 (2019)
- [27] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (Apr 2018)
- [28] Masiha, M.S., Gohari, A., Yassaee, M.H., Aref, M.R.: Learning under distribution mismatch and model misspecification. In: IEEE International Symposium on Information Theory, (ISIT) (2021)
- [29] Mintun, E., Kirillov, A., Xie, S.: On interaction between augmentations and corruptions in natural corruption robustness. arXiv preprint arXiv:2102.11273 (2021)
- [30] Moayeri, M., Feizi, S.: Sample efficient detection and classification of adversarial attacks via self-supervised embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
- [31] Morrison, K., Gilby, B., Lipchak, C., Mattioli, A., Kovashka, A.: Exploring corruption robustness: Inductive biases in vision transformers and mlp-mixers. arXiv preprint arXiv:2106.13122 (2021)
- [32] Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate . Dokl. Akad. Nauk SSSR (1983)
- [33] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (2019)
- [34] Petrini, L., Favero, A., Geiger, M., Wyart, M.: Relative stability toward diffeomorphisms indicates performance in deep nets. In: Advances in Neural Information Processing Systems (2021)
- [35] Raghunathan, A., Xie, S.M., Yang, F., Duchi, J., Liang, P.: Understanding and mitigating the tradeoff between robustness and accuracy. In: Proceedings of the 37th International Conference on Machine Learning (Jul 2020)
- [36] Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do ImageNet classifiers generalize to ImageNet? In: Proceedings of the 36th International Conference on Machine Learning (2019)
- [37] Rusak, E., Schott, L., Zimmermann, R.S., Bitterwolf, J., Bringmann, O., Bethge, M., Brendel, W.: A simple way to make neural networks robust against diverse image corruptions. In: Computer Vision – ECCV 2020 (2020)
- [38] Saikia, T., Schmid, C., Brox, T.: Improving robustness against common corruptions with frequency biased models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
- [39] Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., Bethge, M.: Improving robustness against common corruptions by covariate shift adaptation. In: Advances in Neural Information Processing Systems (2020)
- [40] Smith, L.N., Topin, N.: Super-convergence: Very fast training of residual networks using large learning rates. arXiv preprint arXiv:1708.07120 (2018)
- [41] Sun, J., Mehra, A., Kailkhura, B., Chen, P.Y., Hendrycks, D., Hamm, J., Mao, Z.M.: Certified adversarial defenses meet out-of-distribution corruptions: Benchmarking robustness and simple baselines. arXiv preprint arXiv:arXiv:2112.00659 (2021)
- [42] Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., Schmidt, L.: Measuring robustness to natural distribution shifts in image classification. In: Advances in Neural Information Processing Systems (2020)
- [43] Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds with accuracy. In: International Conference on Learning Representations (May 2019)
- [44] Wang, H., Xiao, C., Kossaifi, J., Yu, Z., Anandkumar, A., Wang, Z.: Augmax: Adversarial composition of random augmentations for robust training. In: Advances in Neural Information Processing Systems (2021)
- [45] Xu, A., Raginsky, M.: Information-theoretic analysis of generalization capability of learning algorithms. In: Advances in Neural Information Processing Systems (2017)
- [46] Yi, M., Hou, L., Sun, J., Shang, L., Jiang, X., Liu, Q., Ma, Z.: Improved OOD generalization via adversarial training and pretraing. In: Proceedings of the 86th International Conference on Machine Learning (2021)
- [47] Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision (2019)
- [48] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
- [49] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Appendix 0.A Maximum entropy transformations
To guarantee as much diversity as possible in our model of common corruptions, we follow the principle of maximum entropy to define our distributions of transformations [8]. Note that using a set of augmentations that guarantees maximum entropy comes naturally when trying to optimize the sample complexity derived from certain information theoretic generalization bounds, both in the clean [45] and corrupted setting [28]. Specifically, the principle of maximum entropy postulates favoring those distributions that are as unbiased as possible given the set of constraints that defines a family of distributions. In our case, these constraints are given in the form of an expected strength, i.e., , desired smoothness, i.e., , and/or some boundary conditions, e.g.,, the displacement field must be zero at the borders of an image.
Let us make this formal. In particular, let denote the space of all images , and let denote a random image transformation distributed according to the law . Further, let us define a set of constraints , which restrict the domain of applicability of , i.e., , and where denotes the space of functions . The principle of maximum entropy postulates using the distribution which has maximum entropy given the constraints:
(9) | ||||
subject to |
where represents the entropy of the distribution [8]. In its general form, solving Eq. 9 for any set of constraints is intractable. However, leveraging results from statistical physics, we will see that for our domains of interest, Eq. 9 has a simple solution. In what follows we derive those distributions for each of our family of transformations.
0.A.1 Spectral domain
As we introduced in Sec. 2, we propose to parameterize our family of spectral transformations using an FIR filter of size . That is, we are interested in finding a maximum entropy distribution over the space of spectral transformations with a finite spatial support.
Nevertheless, on top of this smoothness constraint we are also interested in controlling the strength of the transformations. We define the strength of a distribution of random spectral transformations applied to an image , as the expected norm of the difference between the clean and transformed images, i.e.,
(10) |
which using Young’s convolution inequality is bounded as
(11) |
Indeed, we can see that the strength of a distribution of random smooth spectral transformations is governed by the expected norm of its filter. In the discrete domain, this can be simply computed as
(12) |
Considering this, we should then look for a maximum entropy distribution whose samples satisfy
(13) |
Now, note that this set is defined by an equality constraint involving a sum of quadratic random variables. In this sense, we know that the Equipartition Theorem [1] applies and can be used to identify the distribution of maximum entropy. That is, the solution of Eq. 9 in the case that is given by Eq. 13, is equal to the distribution of FIR filters whose coefficients are iid with law .
0.A.2 Spatial domain
The distribution of diffeomorphisms of maximum entropy with a fixed norm was derived by Petrini et al. in [34]. The derivation is similar to the spectral domain, but with the additional constraint that the diffeomorphisms produce a null displacement at the borders of the image.
0.A.3 Color domain
We can follow a very similar route to derive the distribution of maximum entropy among all color transformations, where, specifically, we constraint the transformations to yield and on every channel independently. Doing so, the derivation of the maximum entropy distribution can follow the same steps as in [34].
Appendix 0.B PRIME implementation details
In this section, we provide additional details regarding the implementation of PRIME described in Sec. 3. Since the parameters of the transformations are empirically selected, we first provide more visual examples for different values of smoothness and strength . Then, we give the exact values of the parameters we use in our experiments supported by additional visual examples and we also describe the parameters we use for the mixing procedure.
0.B.1 Additional transformed examples
We provide additional visual examples for each of the primitives of PRIME illustrating the effect of the following two factors: (i) smoothness controlled by parameter , and (ii) strength of the transformation on the resulting transformed images created by the primitives. Figs. 6, LABEL:, 7, LABEL: and 8 demonstrate the resulting spectrum of images created by applying spectral, spatial and color transformations while varying the parameters and . Notice how increasing the strength of each transformation drifts the augmented image farther away from its clean counterpart, yet produces plausible images when appropriately controlled.



0.B.2 Transformation parameters
We now provide the parameters of each transform that we selected and used in our experiments. In general, the values might vary for inputs of different dimensionality and resolution (i.e., CIFAR-10/100 vs ImageNet images).
0.B.2.1 Spectral transform
Regarding the spectral transform of Eq. 3 we found out that, for the FIR filter , a size of results into semantically preserving images for CIFAR-10/100 and ImageNet. For the latter, one can stretch the filter size to or even , but then slight changes on the strength, , might destroy the image semantics. Eventually, given , we observed that is good enough for CIFAR-10/100 and ImageNet.
0.B.2.2 Spatial transform
Concerning the spatial transform of Eq. 5, for the cut-off parameter we followed the value regimes proposed by Petrini et al. [34] and set for CIFAR-10/100; for ImageNet. Furthermore, for a given , Petrini et al. also compute the appropriate bounds for the transformation strength, , such that the resulting diffeomorphism remains bijective and the pixel displacement does not destroy the image. In fact, in their original implementation555The official implementation of Petrini et al. diffeomorphisms can be found at https://github.com/pcsl-epfl/diffeomorphism., Petrini et al. directly sample instead of explicitly setting the strength. In our implementation, we also follow the same approach.
0.B.2.3 Color transform
Regarding the color transform of Eq. 6 we found out that for CIFAR-10/100 a cut-off value of and a strength of result into semantically preserving images for CIFAR-10/100; while for ImageNet, the corresponding values are and . As for the bandwidth (consecutive frequencies) we observed that a value of was memory sufficient for ImageNet, but for CIFAR-10/100, due to its lower dimensionality, we can afford all the frequencies to be used, e.g., .
Finally, as mentioned in Sec. 3, we randomly sample the strength of the transformations from a uniform distribution of given minimum and maximum values. Regarding the maximum, we always set it to be the one we selected through visual inspection, while the minimum is set to . Fig. 9 displays additional augmented images created by applying each of the primitive transformations in our model using the aforementioned set of parameters on ImageNet. Our choice of parameters produces diverse image augmentations, while retaining the semantic content of the images.

0.B.3 Parameters for mixing procedure
Regarding the mixing parameters of our experiments, we fix the total number of generated transformed images (width) to be . As for the composition of the transformations (depth), we follow a stochastic approach such that, on every iteration , only compositions are performed, with . In fact, in Algorithm 1 we do not explicitly select randomly a new for every but we provide the identity operator instead. This guarantees that, in some cases, no transformation is performed.
Appendix 0.C Detailed experimental setup
We now provide all the experimental details for the performance evaluation of Sec. 4. All models are implemented in PyTorch [33] and are trained for epochs using a cyclic learning rate schedule [40] with cosine annealing and a maximum learning rate of unless stated otherwise. For IN, we fine-tune a regularly pretrained network (provided in PyTorch) with a maximum learning rate of following Hendrycks et al. [20]. We use SGD optimizer with momentum factor and Nesterov momentum [32]. On C-10 & C-100, we set the batch size to and use a weight decay of . On IN-100 and IN, the batch size is and weight decay is . We employ ResNet-18 [19] on C-10, C-100 and IN-100; and use ResNet-50 for IN. The augmentation hyperparameters for AugMix and DeepAugment are the same as in their original implementations.
Appendix 0.D Additional mixing examples
Continuing Sec. 5.2, we present additional examples in Fig. 20 to demonstrate the significance of mixing in PRIME. We observe that the mixing procedure is capable of constructing augmented images that look perceptually similar to common corruptions. To illustrate this, we provide several examples in Fig. 20 for PRIME (upper half) and AugMix (lower half) on CIFAR-10 and ImageNet-100. As shown in Figs. 20, LABEL: and 20, mixing spectral transformations with the clean images tends to create weather-like artefacts resembling frost and fog respectively. Carefully combining clean and spatially transformed images produces blurs (Fig. 20) and even elastic transform (Fig. 20). Moreover, blending color augmentation with clean image produces shot noise as evident in Fig. 20; Whereas spectral+color transformed image looks similar to snow corruption (Fig. 20). All these observations explain the good performance of PRIME on the respective corruptions.
PRIME
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
AugMix
[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
{subfigure}[t]0.395
\captionsetupjustification=centering
Apart from the mixing in PRIME, the mixing in AugMix also plays a crucial role in its performance. In fact, a combination of translate and shear operations with the clean image create blur-like modifications that resemble defocus blur (Fig. 20) and motion blur (Fig. 20). This answers why AugMix excels at blur corruptions and is even better than DeepAugment against blurs (cf. Tab. 7). In addition, on CIFAR-10, notice that mixing solarize and clean produces impulse noise-like modifications (Fig. 20), which justifies the improvements on noise attained by AugMix (refer Tab. 6).
Appendix 0.E SimCLR nearest neighbours
Regarding the minimum distances in the SimCLRv2 embedding space of Tab. 4, we also provide in Fig. 21 some visual examples of the nearest neighbours of each method. In general, we observe that indeed smaller distance in the embedding space typically corresponds to closer visual similarity in the input space, with PRIME generating images that resemble more the corresponding common corruptions, compared to AugMix. Nevertheless, we also notice that for “Blurs” AugMix generates images that are more visually similar to the corruptions than PRIME, an observation that is on par with the lower performance of PRIME (without JSD) on blur corruptions (cf. Tab. 7) compared to AugMix.

Appendix 0.F Cosine distance statistics
Recall that in Tab. 4 we provide the average and the median of the minimum cosine distances computed in the SimCLRv2 embedding space. We now provide in Tab. 5 the values for different percentiles of these distances. We observe that the behaviour is consistent across different percentiles: PRIME (with or without mixing) is always producing feature representations that are more similar to the common corruptions, compared to any version of AugMix. Note also that for smaller percentiles () it seems that PRIME without mixing reaches even lower values than PRIME. However, the difference with respect to PRIME can be considered as insignificant since it is in the order of (note that all values in the table are in the order of ); while a larger population of images () would potentially smooth out this difference.
Method | Min. cosine distance () () | ||||
---|---|---|---|---|---|
None (clean) | 0.33 | 0.64 | 1.97 | 6.43 | 17.44 |
AugMix (w/o mix) | 0.17 | 0.31 | 1.04 | 3.55 | 10.71 |
PRIME (w/o mix) | 0.04 | 0.07 | 0.24 | 1.87 | 7.11 |
AugMix | 0.11 | 0.21 | 0.69 | 2.61 | 8.37 |
PRIME | 0.08 | 0.12 | 0.32 | 1.61 | 5.76 |
Appendix 0.G Embedding space visualization
To qualitatively compare how diverse are the augmentations of PRIME with respect to other methods, we can follow the procedure in [44]. We randomly select 3 images from ImageNet, each one belonging to a different class. For each image, we generate 100 transformed instances using AugMix and PRIME, while with DeepAugment we can only use the original images and the 2 transformed instances that are pre-generated with the EDSR and CAE image-to-image networks that DeepAugment uses. Then, we pass the transformed instances of each method through a ResNet-50 pre-trained on ImageNet and extract the features of its embedding space. On the features extracted for each method, we perform PCA and then visualize the projection of the features onto the first two principal components. We visualize the projected augmented space in Fig. 22, which demonstrates that PRIME generates more diverse (larger variance) features than AugMix and DeepAugment.

Appendix 0.H Performance per corruption
Dataset | Method | Clean | CC | Noise | Blur | Weather | Digital | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gauss. | Shot | Impulse | Defoc. | Glass | Motion | Zoom | Snow | Frost | Fog | Bright. | Contr. | Elastic | Pixel. | JPEG | ||||
C-10 | Standard | 95.0 | 74.0 | 45.1 | 58.7 | 54.9 | 83.2 | 53.3 | 76.9 | 79.1 | 83.1 | 79.3 | 89.0 | 93.6 | 76.3 | 83.9 | 75.1 | 77.9 |
AugMix | 95.2 | 88.6 | 79.3 | 84.8 | 85.8 | 94.1 | 78.9 | 92.4 | 93.4 | 89.7 | 89.0 | 91.9 | 94.3 | 90.5 | 90.5 | 87.6 | 87.5 | |
PRIME | 94.2 | 89.8 | 86.9 | 88.1 | 88.6 | 92.6 | 85.3 | 90.8 | 92.2 | 89.3 | 90.5 | 89.8 | 93.7 | 92.4 | 90.1 | 88.1 | 88.8 | |
C-100 | Standard | 76.7 | 51.9 | 25.3 | 33.7 | 26.6 | 60.8 | 47.1 | 55.5 | 57.6 | 60.8 | 56.2 | 62.5 | 72.2 | 53.2 | 63.4 | 50.1 | 52.7 |
AugMix | 78.2 | 64.9 | 46.7 | 55.1 | 60.6 | 76.2 | 47.3 | 72.6 | 74.3 | 67.4 | 64.4 | 69.9 | 75.5 | 67.4 | 69.6 | 64.9 | 61.8 | |
PRIME | 78.4 | 68.2 | 59.0 | 62.1 | 68.1 | 74.0 | 58.3 | 70.5 | 72.3 | 68.9 | 68.5 | 69.8 | 76.8 | 74.4 | 70.1 | 65.5 | 64.4 |
Dataset | Method | Clean | CC | Noise | Blur | Weather | Digital | |||||||||||
Gauss. | Shot | Impulse | Defoc. | Glass | Motion | Zoom | Snow | Frost | Fog | Bright. | Contr. | Elastic | Pixel. | JPEG | ||||
IN-100 | Standard | 88.0 | 49.7 | 30.9 | 29.0 | 22.0 | 45.6 | 44.6 | 50.4 | 53.9 | 43.8 | 46.2 | 50.5 | 78.6 | 42.9 | 68.8 | 68.0 | 70.6 |
AugMix | 88.7 | 60.7 | 45.2 | 45.8 | 43.4 | 58.7 | 53.3 | 69.5 | 71.0 | 49.1 | 52.7 | 60.2 | 80.7 | 59.6 | 73.3 | 73.6 | 74.7 | |
DA | 86.3 | 67.7 | 76.3 | 75.6 | 75.7 | 64.2 | 61.7 | 61.3 | 62.7 | 54.4 | 62.8 | 55.7 | 81.6 | 49.7 | 69.9 | 83.3 | 80.6 | |
PRIME | 85.9 | 71.6 | 80.6 | 80.0 | 80.1 | 57.2 | 66.3 | 66.2 | 68.2 | 61.5 | 68.2 | 57.2 | 81.2 | 68.3 | 73.7 | 82.9 | 81.9 | |
DA+AugMix | 86.5 | 73.1 | 75.2 | 75.8 | 74.9 | 74.1 | 68.5 | 76.0 | 72.1 | 59.9 | 66.8 | 61.4 | 82.1 | 72.4 | 73.1 | 83.8 | 81.1 | |
DA+PRIME | 84.9 | 74.9 | 81.1 | 80.9 | 81.2 | 70.5 | 74.2 | 72.0 | 71.5 | 66.3 | 73.6 | 56.6 | 81.9 | 72.8 | 74.8 | 83.4 | 82.3 | |
IN | Standard∗ | 76.1 | 39.2 | 29.3 | 27.0 | 23.8 | 38.8 | 26.8 | 38.7 | 36.2 | 32.5 | 38.1 | 45.4 | 68.0 | 39.0 | 45.3 | 44.8 | 53.4 |
AugMix∗ | 77.5 | 48.3 | 40.6 | 41.1 | 37.7 | 47.7 | 34.9 | 53.5 | 49.0 | 39.9 | 43.8 | 47.1 | 69.5 | 51.1 | 52.0 | 57.0 | 60.3 | |
DA∗ | 76.7 | 52.6 | 56.6 | 54.9 | 56.3 | 51.7 | 40.1 | 48.7 | 39.5 | 44.2 | 50.3 | 52.1 | 71.1 | 48.3 | 50.9 | 65.5 | 59.3 | |
PRIME† | 77.0 | 55.0 | 61.9 | 60.6 | 60.9 | 47.6 | 39.0 | 48.4 | 46.0 | 47.4 | 50.8 | 54.1 | 71.7 | 58.2 | 56.3 | 59.5 | 62.2 | |
DA+AugMix | 75.8 | 58.1 | 59.4 | 59.6 | 59.1 | 59.0 | 46.8 | 61.1 | 51.5 | 49.4 | 53.3 | 55.9 | 70.8 | 58.7 | 54.3 | 68.8 | 63.3 | |
DA+PRIME† | 75.5 | 59.9 | 67.4 | 67.2 | 66.8 | 56.2 | 47.5 | 54.3 | 47.3 | 52.8 | 56.4 | 56.3 | 71.7 | 62.3 | 57.3 | 70.3 | 65.1 |
Beyond the average corruption accuracy that we report in Tab. 1, we also provide here the performance of each method on the individual corruptions. The results on CIFAR-10/100 and ImageNet/ImageNet-100 are shown on Tab. 6 and Tab. 7 respectively. Compared to AugMix on CIFAR-10/100, the improvements from PRIME are mostly observed against Gaussian noise (), shot noise (), glass blur () and JPEG compression (). These results show that PRIME can really push the performance against certain corruptions in CIFAR-10/100-C despite the fact that AugMix is already good on these datasets. However, AugMix turns out to be slightly better than PRIME against impulse noise, defocus blur and motion blur modifications; all of which have been shown to be resembled by AugMix created images (see Fig. 20). With ImageNet-100, PRIME enhances the diversity of augmented images, and leads to general improvements against all corruptions except certain blurs. On ImageNet, we observe that, in comparison to DeepAugment, the supremacy of PRIME is reflected on almost every corruption type, except some blurs and pixelate corruptions where DeepAugment is slightly better. When PRIME is used in conjunction with DeepAugment, compared to AugMix combined with DeepAugment, our method seems to lack behind only on blurs, while on the rest of the corruptions achieves higher robustness.
Appendix 0.I Performance per severity level
We also want to investigate the robustness of each method on different severity levels of the corruptions. The results for CIFAR-10/100 and ImageNet/ImageNet-100 are presented in Tab. 8 and Tab. 9 respectively. With CIFAR-10/100, PRIME predominantly helps against corruptions with maximal severity and yields and gains on CIFAR-10 and CIFAR-100 respectively. Besides on ImageNet-100, PRIME again excels at corruptions with moderate to higher severity. This observations also holds when PRIME is employed in concert with DeepAugment. With ImageNet too this trend continues, and we observe that, compared to DeepAugment, PRIME improves significantly on corruptions of larger severity ( and on severity levels 4 and 5 respectively). Also, this behaviour is consistent even when PRIME is combined with DeepAugment and is compared to DeepAugment+AugMix, where we see that again on levels 4 and 5 there is a significant improvement of and respectively.
Dataset | Method | Clean | CC Avg. | Severity | ||||
---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||||
C-10 | Standard | 95.0 | 74.0 | 87.4 | 81.7 | 75.7 | 68.3 | 56.7 |
AugMix | 95.2 | 88.6 | 93.1 | 91.8 | 89.9 | 86.7 | 81.7 | |
PRIME | 94.2 | 89.8 | 92.8 | 91.6 | 90.4 | 88.6 | 85.6 | |
C-100 | Standard | 76.7 | 51.9 | 66.7 | 59.4 | 52.8 | 45.0 | 35.4 |
AugMix | 78.2 | 64.9 | 73.3 | 70.0 | 66.6 | 61.3 | 53.4 | |
PRIME | 78.4 | 68.2 | 74.0 | 71.6 | 69.2 | 65.6 | 60.5 |
Dataset | Method | Clean | CC Avg. | Severity | ||||
1 | 2 | 3 | 4 | 5 | ||||
IN-100 | Standard | 88.0 | 49.7 | 73.5 | 61.0 | 49.8 | 37.2 | 27.0 |
AugMix | 88.7 | 60.7 | 80.4 | 71.8 | 63.8 | 50.3 | 37.2 | |
DA | 86.3 | 67.7 | 81.2 | 75.4 | 69.9 | 61.2 | 50.8 | |
PRIME | 85.9 | 71.6 | 81.7 | 77.5 | 73.4 | 66.9 | 58.4 | |
DA+AugMix | 86.5 | 73.1 | 82.7 | 78.0 | 75.5 | 69.6 | 59.9 | |
DA+PRIME | 84.9 | 74.9 | 82.0 | 78.7 | 76.4 | 71.8 | 65.5 | |
IN | Standard∗ | 76.1 | 39.2 | 60.6 | 49.8 | 39.8 | 27.7 | 18.0 |
AugMix∗ | 77.5 | 48.3 | 66.7 | 58.3 | 51.1 | 39.1 | 26.5 | |
DA∗ | 76.7 | 52.6 | 69.0 | 61.7 | 55.4 | 44.9 | 32.1 | |
PRIME† | 77.0 | 55.0 | 68.9 | 63.1 | 56.9 | 48.3 | 37.6 | |
DA+AugMix | 75.8 | 58.1 | 70.3 | 64.5 | 60.5 | 53.0 | 42.2 | |
DA+PRIME† | 75.5 | 59.9 | 70.8 | 66.3 | 61.6 | 55.1 | 45.9 |
Appendix 0.J Performance on other corruptions
Finally, to examine the universality of PRIME, we evaluate the performance of our ImageNet-100 trained models against two other corrupted datasets: (i) ImageNet-100- (IN-100-) [29], and (ii) stylized ImageNet-100 (SIN-100) [17]. While IN-100- is composed of corruptions that are perceptually dissimilar to those in IN-100-C, stylized IN-100 only retains global shape information and discard local texture cues from IN-100 test images, via style transfer. Thus, it would be interesting test the performance of PRIME against these datasets since it would serve as a indicator for general corruption robustness of PRIME. More information about the corruption types contained in IN-100- is available in the original paper [29].
Method | Clean | IN-100-C | IN-100- | IN-100- | SIN-100 | |||||||||
Avg. | Avg. | BSmpl | Brown | Caustic | Ckbd | CSine | ISpark | Perlin | Plasma | SFreq | Spark | |||
Standard | 88.0 | 49.7 | 55.1 | 47.6 | 71.3 | 70.1 | 66.4 | 29.5 | 45.7 | 72.1 | 34.6 | 34.9 | 78.4 | 18.8 |
AugMix | 88.7 | 60.7 | 61.0 | 63.0 | 73.2 | 75.3 | 69.4 | 39.9 | 44.9 | 77.4 | 42.8 | 44.7 | 79.8 | 28.0 |
DA | 86.3 | 67.7 | 63.8 | 77.1 | 76.6 | 72.6 | 60.9 | 42.9 | 44.3 | 78.0 | 43.4 | 64.5 | 77.8 | 29.9 |
PRIME | 85.9 | 71.6 | 65.0 | 74.9 | 74.3 | 73.2 | 59.2 | 53.4 | 47.5 | 76.8 | 48.6 | 66.9 | 75.5 | 33.1 |
+1.5x epochs | 86.1 | 72.5 | 65.9 | 77.1 | 75.6 | 74.1 | 59.4 | 54.0 | 46.3 | 77.6 | 50.4 | 67.7 | 76.4 | 34.1 |
Tab. 10 enumerates the classification accuracy of different standalone approaches against IN-100- on average, individual corruptions in IN-100- and SIN-100. We can see that PRIME surpasses AugMix and DeepAugment by and respectively on IN-100-. PRIME particularly helps against certain distortions such as blue noise sample (BSmpl), inverse sparkles and plasma noise. PRIME also works well against style-transferred images in SIN-100 and improves accuracy by over AugMix and over DeepAugment. Besides, the diversity of our method means that we can actually get a better performance by increasing the number of training epochs. With 1.5x training epochs, we observe about accuracy refinement on each benchmark.
We also perform a similar analysis with ImageNet trained models and evaluate their robustness on three other distribution shift benchmarks: (i) IN- [29], (ii) SIN [17] as described previously and (iii) ImageNet-R (IN-R) [20]. ImageNet-R contains naturally occurring artistic renditions (e.g., paintings, embroidery, etc.) of objects from the ImageNet dataset. The classification accuracy achieved by different methods on these datasets is listed in Tab. 11. On IN-, PRIME outperforms AugMix and DeepAugment by and respectively. Besides, PRIME also obtains competitive results on IN-R and SIN datasets. Altogether, our empirical results indicate that the performance gains obtained by PRIME indeed translate to other corrupted datasets.
Method | Clean | IN-C | IN- | IN- | IN-R | SIN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg. | Avg. | BSmpl | Brown | Caustic | Ckbd | CSine | ISpark | Perlin | Plasma | SFreq | Spark | ||||
Standard∗ | 76.1 | 39.2 | 40.0 | 36.2 | 57.8 | 54.1 | 46.1 | 14.4 | 20.9 | 61.6 | 24.3 | 19.0 | 65.2 | 36.2 | 7.4 |
AugMix∗ | 77.5 | 48.3 | 46.5 | 59.5 | 56.5 | 59.1 | 51.7 | 25.6 | 21.6 | 65.3 | 23.1 | 36.2 | 66.4 | 41.0 | 11.2 |
DA∗ | 76.7 | 52.6 | 48.3 | 60.1 | 61.1 | 57.7 | 46.8 | 25.4 | 24.4 | 68.4 | 26.5 | 45.6 | 66.8 | 42.2 | 14.2 |
PRIME† | 77.0 | 55.0 | 49.6 | 59.5 | 61.4 | 60.1 | 48.1 | 26.9 | 28.3 | 66.5 | 36.4 | 41.9 | 66.5 | 42.2 | 14.0 |