This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Label Augmentation for Neural Networks Robustness

Fatemeh Amerehi
University of Limerick, Ireland
[email protected]
&Patrick Healy
University of Limerick, Ireland
[email protected]
Abstract

Out-of-distribution generalization can be categorized into two types: common perturbations arising from natural variations in the real world and adversarial perturbations that are intentionally crafted to deceive neural networks. While deep neural networks excel in accuracy under the assumption of identical distributions between training and test data, they often encounter out-of-distribution scenarios resulting in a significant decline in accuracy. Data augmentation methods can effectively enhance robustness against common corruptions, but they typically fall short in improving robustness against adversarial perturbations. In this study, we develop Label Augmentation (LA), which enhances robustness against both common and intentional perturbations and improves uncertainty estimation. Our findings indicate a Clean error rate improvement of up to 23.29% when employing LA in comparisons to the baseline. Additionally, it enhances robustness under common corruptions benchmark by up to 24.23%. When tested against FGSM and PGD attacks, improvements in adversarial robustness are noticeable, with enhancements of up to 53.18% for FGSM and 24.46% for PGD attacks.

1 Introduction

Real-world objects exhibit a diverse array of intertwined attributes. While certain characteristics, such as the class identity of the object, are permanent, others, like the lighting conditions or pose of the object, are transient (Gabbay & Hoshen, 2019). In fact, how would you interpret the images in Fig. 1? If you were to present them to someone and ask about their interpretation, they would likely identify blurry birds and cars in snow. Despite the variations in the images, we are capable of effectively distinguishing between the class identity and the transient attributes of an object.

We already assign names and labels for objects around us, but we also have names for concepts like brightness, warmth, noisiness, and many more. The birds or cars themselves are unchanged, but the sharpness and colors are different. Essentially, the name/labels remain invariant to us, but we still recognize that there exist some other elements that differ from each other. The process of training a machine to make similar distinctions among various attributes in observed data is referred to as disentanglement. It aims to find latent representations that adeptly separate the explanatory factors contributing to variations in the input data (Bengio et al., 2013).

Disentangled representations have been shown to improve generalization to unseen scenarios in both generative and discriminative tasks (Gabbay & Hoshen, 2019; Eom & Ham, 2019; Träuble et al., 2021). Deep Neural Networks (DNNs) generalize well under the assumption of Independent and Identically Distributed (IID) data, where both training and test datasets come from the same distribution. Yet, high IID accuracy does not guarantee out-of-distribution (OOD) generalization where train and test distributions mismatch  (Liu et al., 2021a).

Refer to caption
Figure 1: What do you see when looking at the images?

Common corruptions (Hendrycks & Dietterich, 2019) and adversarial perturbations (Goodfellow et al., 2014) are two examples of OOD scenarios leading to performance deterioration. A widely used approach to mitigate performance drop is to incorporate data augmentation into the training pipeline (Shorten & Khoshgoftaar, 2019). While data augmentation enhances model robustness, current methods tend to improve either common corruption or adversarial perturbation individually, rather than concurrently enhancing both.

Beyond vulnerability to distributional shifts, another common issue is miscalibration—the tendency of models to generate overconfident predictions when the training examples are IID. This overconfidence is further intensified under OOD scenarios (Ovadia et al., 2019). In this study, for enhanced robustness, we present a simple yet effective method using Label Augmentation (LA) for disentangling the class of an object from irrelevant noise. The LA proves effective in enhancing calibration and robustness against both common and intentional perturbations of input data.

2 Related Works

Focusing on vision models, we review relevant literature on augmentation methods for robustness against distributional shifts, including adversarial attacks, and common corruptions, alongside calibration.

Augmentation methods for robustness under distribution shift. Vision models often experience a drop in performance under common or intentional perturbations of images (Hendrycks & Dietterich, 2019; Szegedy et al., 2013). For instance, they show vulnerability to blur and Gaussian noise (Vasiljevic et al., 2016; Dodge & Karam, 2016), as well as factors such as brightness and contrast (Hendrycks & Dietterich, 2019), occlusion (Zhong et al., 2020), and small translations or rescalings of the input data (Azulay & Weiss, 2018). Additionally, when the model encounters adversarial perturbations, its performance tends to suffer even more (Goodfellow et al., 2014; Papernot et al., 2016a; Tramèr et al., 2017b; Athalye et al., 2018).

To mitigate performance degradation caused by common corruptions, a commonly employed strategy is the incorporation of label-preserving image augmentation into the training pipeline (Shorten & Khoshgoftaar, 2019). In the simplest form, data augmentations translate to simple transformations such as horizontal flipping, color shift, and random cropping (Krizhevsky et al., 2012; He et al., 2016). A more complex array of augmentations includes techniques such as random erasing (DeVries & Taylor, 2017; Zhong et al., 2020), neural style transfer (Jackson et al., 2019; Geirhos et al., 2018), image mixing (Zhang et al., 2017; Inoue, 2018; Summers & Dinneen, 2019; Hong et al., 2021; Yao et al., 2022), training with noise (Lopes et al., 2019; Rusak et al., 2020), randomized manipulations of images (Xu et al., 2023), combination and mixing of augmentation chains (Hendrycks et al., 2019; Modas et al., 2022), or search for an optimal augmentation policy (Cubuk et al., 2019).

Defense mechanisms to tackle adversarial examples—a carefully crafted perturbations to mislead a classifier— include defensive distillation (Papernot et al., 2016b), feature squeezing (Xu et al., 2017), adversarial detection (Metzen et al., 2017; Pang et al., 2018; Deng et al., 2021a), gradient regularization (Tramèr et al., 2017a; Wu et al., 2020), and adversarial training (Goodfellow et al., 2014; Madry et al., 2017). Among these, the most effective strategy is adversarial training, which involves augmenting training data with adversarial examples to enhance its robustness against attacks or to reduce its test error on clean inputs (Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016; Ford et al., 2019; Bai et al., 2021).

Adversarial examples could be augmented in various ways, including incorporating synthetic data (Gowal et al., 2021; Wang et al., 2023b), unlabeled data (Carmon et al., 2019; Deng et al., 2021b), injecting noise to the hidden layers (Qin et al., 2022), adversarial mixture of transformations (Wang et al., 2021), or reconfiguration of the low and high-frequency components of intermediate feature representations (Bu et al., 2023). Other methods introduce weight perturbation to enhance model robustness (Wu et al., 2020), regulating gradient growth to prevent robust overfitting during multi-step adversarial training (Li et al., 2022), or ensemble training to mitigate vulnerabilities across sub-models while preserving comparable accuracy on clean data (Cai et al., 2023).

The effectiveness of adversarial training depends on the choice of adversarial examples. For instance, training exclusively with Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014) enhances robustness against non-iterative attacks but lacks robustness against iterative attacks such as Projected Gradient Descent (PGD) attack (Kurakin et al., 2016; Madry et al., 2017). Whether adversarial training enhances robustness against common corruptions has conflicting views in the literature. While some studies suggest a positive correlation (Ford et al., 2019; Kireev et al., 2022), others argue that adversarial robustness and robustness to common perturbations are independent (Laugros et al., 2019).

Calibration. Despite performing well in generalization and prediction under the IID setting, DNNs often produce overconfident results, which worsen even more in OOD settings (Guo et al., 2017; Ovadia et al., 2019; Gawlikowski et al., 2021). Well-calibrated uncertainty estimates indicate when the output of models is reliable and when it is questionable. Temperature scaling with a validation set (Guo et al., 2017), or ensembling predictions from independently trained classifiers on the entire dataset with random initialization (Lakshminarayanan et al., 2017), improves calibration. Using soft labels (Szegedy et al., 2016)—a weighted average of one-hot training labels and a uniform distribution over targets—often prevents the network from becoming overly confident in specific labels, thus reducing calibration errors (Müller et al., 2019; Lukasik et al., 2020).

Refer to caption
Figure 2: The Cifar10 dataset includes 10 classes representing airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The one-hot label for horses is [0 0 0 0 0 0 0 1 0 0]. Considering three distinct augmentation operation classes like contrast, noise, and blur; the one-hot label for noise is [0 1 0]. In standard augmentation, labels remain invariant. When applying Label Augmentation with a smoothing factor δ\delta, the resulting label for noisy image of a horse is [0 0 0 0 0 0 0 1δ1-\delta 0 0 0 δ\delta 0]. This maintains invariance with original categories while distinguishing between more abstract concepts, such as noisy and noise-free inputs.

3 Label Augmentation

We now introduce our central idea.

Let 𝒪={oi}i=1M\mathcal{O}=\{o_{i}\}_{i=1}^{M} represent a set of mm label-preserving augmentation operations, each of which, when applied to an input data, introduces certain effects to it. Let 𝒵={zi}i=1M\mathcal{Z}=\{z_{i}\}_{i=1}^{M} be the one-hot encoded name for each of the operations. Given a collection of objects 𝒳={xi}i=1N\mathcal{X}=\{x_{i}\}_{i=1}^{N} and a set of labels 𝒴={yi}i=1K\mathcal{Y}=\{y_{i}\}_{i=1}^{K}, we humans assign a label yiy_{i} to each object xix_{i} based on the attributes we observe in them. Further, let 𝒪(𝒳)\mathcal{O}(\mathcal{X}) represent operations within 𝒪\mathcal{O} that are applied to objects in 𝒳\mathcal{X}.

If we select an operation oj𝒪o_{j}\in\mathcal{O} and apply it to xi𝒳x_{i}\in\mathcal{X} as the certain attributes of the xix_{i} are affected—despite the identity of the class object remains the same for each elements in 𝒪(𝒳)\mathcal{O}(\mathcal{X})—we no longer assign the same label yiy_{i} to the transformed objects/images (revisit Fig. 1). Instead, we assign a richer name that incorporates both the class identity and the effect. In essence, we disentangle class identity from transformations/operations.

In a task of KK-class classification, the goal is to model the mapping from the input data xix_{i} to its corresponding class label yiy_{i} through a DNN f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y}. Typically, this involves using a softmax output layer and cross-entropy loss to quantify the dissimilarity between yiy_{i}, the true class distribution (one-hot encoded), and pip_{i}, the softmax of predictions. The cross-entropy loss, is defined as CE(yi,pi)=k=1Kyiklogpik\mathcal{L}_{CE}(y_{i},p_{i})=-\sum_{k=1}^{K}y_{ik}\log p_{ik}, where yiky_{ik} represents the kk-th element of the true class distribution yiy_{i}, and pikp_{ik} denotes the kk-th element of the predicted class distribution pip_{i}.

To extend the model generalization capability to OOD data, existing augmentation methods train with augmented input 𝒪(𝒳)\mathcal{O}(\mathcal{X}) while assigning the same label to transformed and untransformed input data to help the model learn representations that remain invariant to a set of data augmentations. Essentially, existing techniques aim to find a mapping f:𝒳𝒪(𝒳)𝒴f:\mathcal{X}\cup\mathcal{O}(\mathcal{X})\to\mathcal{Y}. Given the distinction we make in our naming between 𝒳\mathcal{X} and 𝒪(𝒳)\mathcal{O}(\mathcal{X}), and considering the use of such augmented inputs in training DNNs, would it not be advantageous to explicitly communicate to the model that labels differ in additional factors beyond class identity? To enable this, we employ Label Augmentation (LA).

In LA, the objective is to maintain invariance to the input class category yiy_{i}, while simultaneously enabling the distinctions between xix_{i} and its various transformed versions. To achieve this, after any transformation ojo_{j} on input data xix_{i}, we simply concatenate the two one-hot labels yiy_{i} and zjz_{j} to each other with a factor of δ\delta. In other words, whenever we augment the input data, we augment labels as well. Specifically, the labels assigned to xi~=oj(xi)\tilde{x_{i}}=o_{j}(x_{i}) are defined as in Eq. 1, which represent a vector of length K+MK+M. The label yi~\tilde{y_{i}} has the value of 1δ1-\delta at position ii and the value of δ\delta at position K+jK+j. In other words, LA aims to find a more comprehensive mapping f:𝒳𝒪(𝒳)Concatδ[𝒴,𝒵]f:\mathcal{X}\cup\mathcal{O}(\mathcal{X})\to\text{Concat}_{\delta}[\mathcal{Y},\mathcal{Z}] that maps augmented collections of input to the augmented labels. This is shown in Fig. 2.

yi~=Concat[(1δ)yi,δzj]\tilde{y_{i}}=\text{Concat}[(1-\delta)y_{i},\delta z_{j}] (1)

We consider identity transformation with δ=0\delta=0 as a specific case that represents untransformed input data. In case of transformation, we select δ\delta to be a small value in order to prevent excessive deviation of the model towards the augmented label. The value of δ\delta is drawn from a uniform distribution: δU(0.05,0.1)\delta\sim\text{U}(0.05,0.1).

To ensure the same dimensionality and maintain the class identity yiy_{i} for any untransformed input data xix_{i}, we simply expand the one-hot labels yiy_{i} from KK dimensions to K+MK+M. At position ii, we assign 1, as before, to represent the class identity. In LA training, the loss is computed as LA(y~i,pi~)=k=1K+My~iklogp~ik\mathcal{L}_{LA}(\tilde{y}_{i},\tilde{p_{i}})=-\sum_{k=1}^{K+M}\tilde{y}_{ik}\log\tilde{p}_{ik}, where p~i\tilde{p}_{i} denotes the softmax of predictions for xi~\tilde{x_{i}}.

In the following section, we show that the act of assigning names to operations and augmenting labels leads to better generalization compared to traditional augmentation. Furthermore, as we will demonstrate, this helps achieve better robustness to both common and intentional perturbations, as well as improved calibration.

Refer to caption
Figure 3: Examples of augmentation operations applied in Label Augmentation.

4 Experimental Setup

In the following, we elaborate on the dataset, training configuration, the networks employed, and the evaluation metrics for assessing both robustness and calibration. Afterwards, we present the results and analysis.

4.1 configurations and metrics

Datasets. We utilize the CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) datasets. Both datasets contain 50,000 training images and 10,000 testing images of size 32 × 32 × 3. To assess the robustness of models against common data shift, we evaluate on CIFAR-10-C and CIFAR-100-C benchmark (Hendrycks & Dietterich, 2019). These datasets are created by introducing various distortions to the original CIFAR-10 and CIFAR-100 datasets, and contains a total of 15 corruptions of types such as noise, blur, weather, and digital distortions. Each distortion is incorporated at severity levels 1s51\leq s\leq 5. In the following, we refer to these datasets as CIFAR and CIFAR-C, respectively.

Robustness metrics. In Tables 1,  2, and 3, the Clean Error represents the standard classification error on uncorrupted test data. For a given corruption cc, the error rate at corruption severity ss is denoted as Ec,s\text{E}_{c,s}. Taking the average error across these severities ss; the corruption error CEc\text{CE}_{c}, is computed as CEc=15s=15Ec,s\text{CE}_{c}=\frac{1}{5}\sum_{s=1}^{5}\text{E}_{c,s}. Finally, the mean Corruption Error mCE=115c=115CEc\text{mCE}=\frac{1}{15}\sum_{c=1}^{15}\text{CE}_{c} is the average of all 15 corruption errors. This gives one value for robustness comparisons against common corruptions (Hendrycks & Dietterich, 2019; Hendrycks et al., 2019).

To measure robustness against adversarial perturbations, we employ FGSM (Goodfellow et al., 2014) and 40-step iterative PGD (Madry et al., 2017) attacks, both with LL_{\infty} constraints using two budgets ε=0.03\varepsilon=0.03 and ε=0.3\varepsilon=0.3. We utilize the implementations provided by the cleverhans 4.0 library (Papernot et al., 2018).

Calibration metric. A classifier is considered calibrated when it can consistently predict their accuracy (Guo et al., 2017). For instance, with 100 predictions, each at a confidence level of 0.7, we expect 70 correct classifications. We evaluate the calibration of the network using the Expected Calibration Error (ECE) (Guo et al., 2017) and Root Mean Square (RMS) Calibration Error. Given the finite size of the test sets, ECE and RMS Calibration Error are estimated by grouping all nn test examples into MM equal size bins, ordered by prediction confidence—the winning softmax score.

Let BmB_{m} represent the set of samples whose predictions fall into bin mm. The accuracy and confidence of BmB_{m} is defined as acc(Bm)=1|Bm|iBm1(y^i=yi)\text{acc}(B_{m})=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}1(\hat{y}_{i}=y_{i}) and conf(Bm)=1|Bm|iBmp^i\text{conf}(B_{m})=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}\hat{p}_{i}, respectively. Here, yi^\hat{y_{i}} and yiy_{i} represent the predicted and ground-truth labels for input data xix_{i}, and p^i\hat{p}_{i} is the confidence—winning score—of sample ii. The ECE and RMS errors is then defined as in Eq. 2 and Eq. 3, respectively. We use the implementations provided by the TorchMetric 1.4.0 library (Lightning, 2024).

ECE=m=1M|Bm|n|acc(Bm)conf(Bm)|\text{ECE}=\sum_{m=1}^{M}\frac{|B_{m}|}{n}\Bigl{|}\text{acc}(B_{m})-\text{conf}(B_{m})\Bigr{|} (2)
RMS=m=1M|Bm|n(acc(Bm)conf(Bm))2\text{RMS}=\sqrt{\sum_{m=1}^{M}\frac{|B_{m}|}{n}\Bigl{(}\text{acc}(B_{m})-\text{conf}(B_{m})\Bigr{)}^{2}} (3)

Training Configuration and hyper-parameter setting. We run all experiments on a GeForce RTX-3080Ti GPU with CUDA Version 12.0 using the PyTorch version 2.0.1. To assess robustness across different architectures, we use a standard LeNet (LeCun et al., 1998), a ResNet-50 (He et al., 2016), a 40-2 Wide ResNet (Zagoruyko & Komodakis, 2016), a 32×\times4d ResNeXt-50 (Xie et al., 2017), and a Swin Transformer (Liu et al., 2021b). All networks start with a learning rate of 0.1, which decays by a factor of 0.0001 according to a cosine annealing learning rate (Loshchilov & Hutter, 2016). Before any augmentations, we preprocess input images through random horizontal flip and cropping. In all experiments, we train for 25 epochs with default weights and optimize with stochastic gradient descent with a momentum of 0.9. For both training and evaluation, we set the batch size to 1024. Each experiment in Tables 34, and 5 are conducted three times, and the averages along with their corresponding standard deviations are reported.

Baseline comparisons. We begin by comparing LA with traditional augmentations, selecting a set of label-preserving augmentations such as Plasma noise (Nicolaou et al., 2022), Planckian Jitter (Zini et al., 2022), and Gamma adjustment, as illustrated in Fig. 3. First, we train the models with these augmentations using LA, and then repeat the training with normal augmentations, without LA. The results of these experiments are presented in Tables 1 and 2, and we will analyze them in the next section.

To measure the effectiveness of LA, we make a comparison between various augmentation techniques, including Mixup (Zhang et al., 2017), Augmix (Hendrycks et al., 2019), and AutoAugment (Cubuk et al., 2019). Additionally, we include FGSM (Goodfellow et al., 2014) and 10-step iterative PGD (Madry et al., 2017), both with LL_{\infty} constraints and ε=0.3\varepsilon=0.3, for comparisons against adversarial training. Except for LA, in all these experiments, we adjust the last classification layer of networks to output 10 and 100 class categories corresponding to CIFAR-10, and CIFAR-100 datasets, respectively. For LA, depending on the number of operations used for augmentation, we add additional units to accommodate the prediction of augmented classes. After training the models with LA, during the testing phase, we ignore the outputs for augmented labels—prior to the softmax operation—and only consider the class identity labels as the final output of the models. This is because the class categories in CIFAR and CIFAR-C datsets are not linked to augmentation operations like Plasma noise, etc. This could be thought of as asking a person to filter out what they see in a transformed picture and just identify the class without providing extra detail.

Augmentation operations. Corruptions employed in the CIFAR-C benchmark dataset (Hendrycks & Dietterich, 2019) include Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic Transform, Pixelate, and JPEG Compression. According to Hendrycks & Dietterich (2019), models evaluated on CIFAR-C should avoid using identical augmentations as those represented in the benchmark.

AutoAugment (Cubuk et al., 2019) searches for various operations for data augmentation, as well as probabilities and magnitudes at which operations are applied. Through this, it identifies the optimal policy for models to achieve the highest validation accuracy on a given target dataset. The operations available for selection during the search, in AutoAugment, includes five geometric transforms (shear x/y, translate x/y, and rotate), two color transforms (color, invert), six intensity transforms (brightness, sharpness, solarize, equalize, autocontrast, contrast, and posterize), as well as cutout (DeVries & Taylor, 2017) and sample pairing (Zhang et al., 2017; Inoue, 2018). Some of these transformations may overlap with those in CIFAR-C.

To avoid this, AugMix (Hendrycks et al., 2019) integrates augmentations from AutoAugment, that do not overlap with the CIFAR-C benchmark (Hendrycks & Dietterich, 2019). Specifically, AugMix employs five geometric transforms (shear x/y, translate x/y, and rotate) and four intensity transforms (solarize, equalize, autocontrast, and posterize). However, ensuring a complete independence between augmentations is challenging. For example, (Rusak et al., 2020) and (Mintun et al., 2021) highlight the similarity between posterize and JPEG compression, as well as shear and translation to blur, respectively. Taking this into account, since we evaluate the robustness on CIFAR-C and also want to compare with AutoAugment and AugMix, we therefore, choose Plasma noise (Nicolaou et al., 2022), Planckian Jitter (Zini et al., 2022), and Gamma adjustment as augmentations to be disjoint from the three. Noteworthy, for the sake of complete comparisons between methods, in our implementation of AutoAugment for CIFAR-C evaluation, unlike AugMix, we do not remove overlapping augmentation like cutout (DeVries & Taylor, 2017), brightness, etc. Moreover, we conduct additional experiments by incorporating Augmix as the augmentation operation in LA. We denote these experiments as AugMix++{}^{\mathbin{+\mkern-10.0mu+}} in Tables 3 and 4. In case of adversarial training, we employ FGSM (Goodfellow et al., 2014) and 10-step iterative PGD (Madry et al., 2017) both with LL_{\infty} constraints and ε=0.3\varepsilon=0.3.

Label Smoothing and Multi-Task Learning. Label smoothing (LS) is an effective technique for regularizing DNNs. It achieves this by generating soft labels through a weighted average between a uniform distribution and the original hard labels. LS is typically used to address overfitting during training, leading to improved classification accuracy (Szegedy et al., 2016; Müller et al., 2019). While LS distributes the probability mass between the correct class label and all other classes, LA allocates the probability mass only between the identity class label and the augmentation labels. We compare the performance of LA against LS, where both techniques employ the same smoothing factor, δ\delta. Furthermore, we compare LA with Multi-Task Learning (MTL), which offers an alternative approach to distinguishing between class identity and an indicator of augmentations. MTL involves optimizing a neural network across multiple related tasks simultaneously, aiming to enhance their generalization capabilities by leveraging shared patterns and representations among tasks (Standley et al., 2020; Zhang & Yang, 2021; Xin et al., 2022). We use a 40-2 Wide ResNet architecture as a shared feature extractor, along with two task-specific output heads: one for predicting class identity and the other for identifying the type of distortions applied to images. We employ δ\delta as the weight for the task of augmentation predictions and 1δ1-\delta for the task of class identity prediction. The results of these experiments are presented in Table 5, for both CIFAR datasets.

4.2 Results

Comparisons of regular data augmentation versus LA. Table 1 compares standard training—where the model is trained on clean data and tested on clean data—with training via LA when using different numbers of operations for augmentations. The clean error improves when employing a single operation, while using two and three operations results in greater error reduction, up to 17.51% when employing Plasma and Gamma (i.e.i.e. P.G.++{}^{\mathbin{+\mkern-10.0mu+}}, where notation ++{}^{\mathbin{+\mkern-10.0mu+}} signifies label augmentation with factor δ\delta). Similarly, we observe improvement in mCE when employing three operations. Specifically, P.G.++{}^{\mathbin{+\mkern-10.0mu+}} achieves a 22.70% improvement compared to the mCE of standard training. Introducing one additional operation, Jitter, results in a further improvement of 28.56% compared to the standard. Both calibration errors, ECE and RMS, show improvement when utilizing LA. However, values demonstrate no correlation between the number of operations employed and the improvement achieved. Similarly, errors under adversarial attacks FGSM and PGD improve, yet there is no apparent relationship between the number of operations employed and the observed enhancements. More details can be found in Fig. 4 4.

The results from repeating the experiment without using LA are shown in Table 2. In every instance, the Clean and calibration errors deteriorate. The mCE improves in most cases by up to 7.14% when using three operations. The adversarial error shows both improvements and deteriorations in different cases. Overall, there is no clear pattern between the number of operations and the fluctuation of errors.

Based on Tables 1 and 2, as well as Fig. 4 4, it is evident that LA outperforms standard augmentation in minimizing Clean, mCE, calibration, and adversarial errors. As P.G.++{}^{\mathbin{+\mkern-10.0mu+}} demonstrates the most significant enhancement in clean and PGD errors on Wide ResNet-50, while P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} show better improvements in mCE and calibration, we proceed with these two operations to compare LA with other augmentation methods across different networks.

Comparisons of other augmentations versus LA. Tables 3 and  4 summarize the results obtained by applying Mixup (Zhang et al., 2017), AugMix (Hendrycks et al., 2019), AutoAugment (Cubuk et al., 2019), adversarial training with FGSM (Goodfellow et al., 2014), 10-step iterative PGD (Madry et al., 2017), and LA for various networks, for CIFAR-10 and CIFAR-100 datasets, respectively.

Robustness enhancement for CIFAR-10. From the data presented in Table 3, clearly LA contributes to an improvement in Clean error across various architectures. When considering the average performance across the five networks, LA outperforms all other methods,with P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} improving the standard training by 23.29%. In terms of enhancing robustness against common corruptions, there is a consistent improvement across AugMix, AutoAugment, FGSM and LA. However, depending on the employed networks, there is a variation in error rates for Mixup training, either showing a decrease or increase in comparison to the baseline. In all cases, PGD worsens both clean and mCE.

When comparing improvements in calibration errors, similar to the findings of (Wang et al., 2023a), our results show that mixup training tends to increase calibration errors compared to standard training. Similarly, on average, adversarial training negatively affects ECE error. While RMS improves with FGSM training by 4.95%, PGD diminishes it. Other methods all improve uncertainty estimates, among which, AutoAugment outperforms others, with reductions of 44.33% and 37.01% in ECE and RMS, respectively. Among LA training experiments, AugMix++{}^{\mathbin{+\mkern-10.0mu+}} surpasses other augmentations with reductions in ECE and RMS by 33.65% and 31.15%, while P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} reduces the calibration errors for both metrics by 2.18% and 7.65%. Under both ε\varepsilon budgets of 0.03 and 0.3, Mixup, FGSM, PGD, and all LA trainings show improvements. The enhancement in robustness from training with FGSM (ε\varepsilon=0.3) against FGSM attacks are 16.15% and 47.22%, compared to the standard adversarial error. However, training with FGSM barely improves error rates on PGD attack with ε=0.3\varepsilon=0.3. In contrast, LA not only generalizes to both attacks but also outperforms adversarial training in both FGSM and PGD when considering P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}. More specifically, the robustness to FGSM and PGD improves by 61.24% and 49.54% at budget ε=0.03\varepsilon=0.03 and 53.18% and 24.46% at budget ε=0.3\varepsilon=0.3, respectively. More details on the percentages of improvement compared to standard training can be found in Table 6.

Robustness enhancement for CIFAR-100. Table 4 presents the performance of the aforementioned methods on the CIFAR-100 dataset. Except for Mixup and adversarial training methods, the remaining techniques all improve clean errors, compare to standard training. Specifically, AugMix, AutoAugment, P.G.++{}^{\mathbin{+\mkern-10.0mu+}}, P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}, and AugMix++{}^{\mathbin{+\mkern-10.0mu+}} reduce the error by 1.58%, 4.49%, 3.24%, 4.31%, and 3.53%, respectively. In terms of calibration error, on average, almost all methods show improvements compared to standard training, except for PGD. For both low and high attack budgets, all methods improve the baseline except for AutoAugment and AugMix under PGD attacks. With ε=0.03\varepsilon=0.03, the robustness gain from LA training exceeds that from FGSM and PGD training. Specifically, FGSM and PGD reduce FGSM error by 9.87% and 8.46%, respectively, whereas P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} reduces it by 26.94%. Similarly, with ε=0.03\varepsilon=0.03, LA outperforms adversarial training. FGSM and PGD reduce PGD error by 7.94% and 13.66%, respectively, whereas P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} reduces it by 18.41%. However, with a higher budget of ε=0.3\varepsilon=0.3, the gains from FGSM and PGD training for FGSM attacks are 28.29% and 24.40%, while P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} achieves a gain of 16.93%. The robustness gains for PGD are 0.62%, 10.03%, and 10.38% for FGSM, PGD, and P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}, respectively. More details can be found in Table 7.

Operations used
Error Std. P.++{}^{\mathbin{+\mkern-10.0mu+}} G.++{}^{\mathbin{+\mkern-10.0mu+}} J.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.J.++{}^{\mathbin{+\mkern-10.0mu+}} G.J.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}
Clean 9.54 9.80 9.11 9.05 7.87 8.12 8.37 8.18
mCE 22.69 18.90 19.64 20.59 17.54 18.49 18.98 16.21
ECE 6.33 6.22 5.87 5.84 5.31 6.19 5.76 5.23
RMS 10.52 9.78 9.34 9.05 8.66 9.46 9.53 8.72
FGSM 69.13 49.76 49.65 38.73 39.33 32.37 44.39 34.61
PGD40 94.82 82.77 81.18 83.44 69.71 73.65 77.08 67.89
Table 1: Performance comparisons of baseline to Label Augmentation with different operations on CIFAR-10 using the Wide ResNet-50 architecture. P.++{}^{\mathbin{+\mkern-10.0mu+}}, G.++{}^{\mathbin{+\mkern-10.0mu+}}, and J.++{}^{\mathbin{+\mkern-10.0mu+}} refer to Plasma noise, Gamma Adjustment, and Planckian Jitter, respectively. The symbol ++\mathbin{+\mkern-10.0mu+} denotes the concatenation of labels with a factor of δ\delta during these operations. Both adversarial training, FGSM and PGD use LL_{\infty} constraints with ε=0.3\varepsilon=0.3.
Operations used
Error Std. P. G. J. P.G. P.J. G.J. P.G.J.
Clean 9.54 10.36 10.49 10.11 10.75 10.23 9.76 10.17
mCE 22.69 21.66 22.21 22.98 21.25 21.71 23.11 21.07
ECE 6.33 8.12 8.44 8.03 8.22 8.27 8.33 8.61
RMS 10.52 13.10 13.54 12.67 12.55 12.96 12.87 13.74
FGSM 69.13 67.89 71.49 68.88 67.12 69.53 67.99 68.68
PGD40 94.82 93.25 93.73 93.59 93.91 93.88 94.12 93.90
Table 2: The performance comparisons between the baseline and normal augmentation with different operations on CIFAR-10 using the Wide ResNet-50 architecture. P., G., and J. represent Plasma noise, Gamma Adjustment, and Planckian Jitter, respectively. Both adversarial training, FGSM and PGD use LL_{\infty} constraints with ε=0.3\varepsilon=0.3.
Train
Std. Mixup AugMix AutoAug. FGSM PGD10 P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} AugMix++{}^{\mathbin{+\mkern-10.0mu+}}
Clean LeNet 14.97±\pm0.53 16.81±\pm0.47 14.82±\pm0.46 14.49±\pm0.29 14.83±\pm0.58 26.57±\pm0.95 14.11±\pm0.65 12.89±\pm0.27 17.65±\pm0.75
ResNet 10.06±\pm0.22 10.43±\pm0.58 9.87±\pm0.11 9.12±\pm0.25 10.59±\pm0.36 21.84±\pm0.73 8.89±\pm0.14 8.46±\pm0.25 9.53±\pm0.25
ResNeXt 11.37±\pm0.25 11.77±\pm0.73 10.97±\pm0.57 9.46±\pm0.22 11.82±\pm0.21 24.54±\pm0.61 9.18±\pm0.19 8.64±\pm0.31 10.74±\pm0.23
WResNet 9.57±\pm0.21 10.40±\pm0.23 9.41±\pm0.25 8.34±\pm0.14 10.27±\pm0.64 25.13±\pm1.77 7.96±\pm0.52 8.08±\pm0.19 9.46±\pm0.31
SwinT 14.24±\pm0.15 12.09±\pm0.18 10.15±\pm0.15 13.43±\pm0.07 10.19±\pm0.26 16.32±\pm2.06 8.31±\pm0.63 8.12±\pm0.05 10.15±\pm0.36
Mean 12.04±\pm0.27 12.30±\pm0.44 11.04±\pm0.31 10.97±\pm0.19 11.54±\pm0.41 22.88±\pm1.22 9.69±\pm0.43 9.24±\pm0.21 11.51±\pm0.38
mCE LeNet 28.24±\pm0.36 28.21±\pm0.81 23.84±\pm0.64 25.15±\pm0.31 28.03±\pm0.44 40.66±\pm0.42 24.05±\pm0.27 23.71±\pm0.09 30.88±\pm0.45
ResNet 24.71±\pm0.29 22.95±\pm0.57 18.71±\pm0.05 19.47±\pm0.15 22.40±\pm0.71 37.49±\pm0.78 18.56±\pm0.15 18.52±\pm0.22 23.71±\pm0.33
ResNeXt 24.83±\pm0.34 23.86±\pm0.19 18.75±\pm0.22 19.12±\pm0.18 22.28±\pm0.73 40.50±\pm0.39 18.51±\pm0.11 18.10±\pm0.23 26.45±\pm0.71
WResNet 22.73±\pm0.27 21.55±\pm0.27 17.01±\pm0.28 17.89±\pm0.27 20.99±\pm0.32 42.58±\pm2.53 16.64±\pm0.39 16.78±\pm0.71 22.75±\pm0.89
SwinT 23.99±\pm0.14 24.13±\pm0.22 18.61±\pm0.17 23.96±\pm0.10 20.97±\pm0.39 24.27±\pm1.38 17.20±\pm0.38 17.22±\pm0.11 18.61±\pm0.87
Mean 24.90±\pm0.28 24.14±\pm0.41 19.38±\pm0.27 21.12±\pm0.20 22.93±\pm0.52 37.10±\pm1.10 18.99±\pm0.26 18.87±\pm0.27 24.48±\pm0.65
Calibration ECE LeNet 6.79±\pm0.23 15.33±\pm1.52 5.15±\pm0.28 2.92±\pm0.06 6.46±\pm0.91 15.29±\pm0.65 6.18±\pm1.02 7.65±\pm0.54 2.18±\pm0.12
ResNet 6.29±\pm0.05 16.95±\pm2.03 5.38±\pm0.15 3.69±\pm0.19 5.99±\pm1.65 16.32±\pm0.76 5.82±\pm0.38 5.81±\pm0.66 3.04±\pm0.23
ResNeXt 7.68±\pm0.22 19.06±\pm1.27 6.08±\pm0.41 3.96±\pm0.18 7.72±\pm0.92 19.25±\pm0.27 6.18±\pm2.41 5.83±\pm0.31 5.48±\pm0.11
WResNet 6.31±\pm0.21 18.45±\pm0.67 5.55±\pm0.34 4.13±\pm0.31 6.45±\pm0.73 25.16±\pm0.98 5.11±\pm0.84 6.19±\pm0.26 6.17±\pm0.39
SwinT 4.13±\pm0.11 8.73±\pm0.93 3.31±\pm0.12 2.67±\pm0.12 4.91±\pm1.60 3.05±\pm0.25 5.17±\pm1.22 5.04±\pm0.79 3.83±\pm0.75
Mean 6.24±\pm0.16 15.70±\pm1.28 5.09±\pm0.26 3.47±\pm0.17 6.31±\pm1.16 15.81±\pm0.58 5.69±\pm1.17 5.90±\pm0.51 4.14±\pm0.32
RMS LeNet 9.64±\pm0.97 16.00±\pm1.83 7.61±\pm0.58 4.76±\pm0.32 8.92±\pm1.92 18.09±\pm0.78 9.31±\pm0.71 9.49±\pm0.57 3.67±\pm0.19
ResNet 10.33±\pm0.22 17.49±\pm2.45 8.87±\pm0.25 6.57±\pm0.28 9.56±\pm2.55 19.97±\pm0.35 9.92±\pm0.83 9.18±\pm0.63 5.03±\pm0.28
ResNeXt 12.06±\pm0.31 19.80±\pm1.39 10.21±\pm0.79 7.04±\pm0.42 11.51±\pm1.03 22.71±\pm0.52 10.02±\pm0.35 9.56±\pm0.77 8.35±\pm0.26
WResNet 10.37±\pm0.38 19.21±\pm1.05 9.22±\pm0.71 7.71±\pm0.39 10.23±\pm0.78 28.56±\pm0.82 8.56±\pm0.31 8.96±\pm0.43 9.91±\pm0.23
SwinT 6.48±\pm0.13 9.66±\pm0.89 6.52±\pm0.51 4.71±\pm0.17 6.24±\pm1.33 8.85±\pm0.31 9.01±\pm0.45 7.95±\pm0.55 6.52±\pm0.59
Mean 9.78±\pm0.40 16.43±\pm1.52 8.49±\pm0.57 6.16±\pm0.32 9.69±\pm1.52 18.84±\pm0.56 9.36±\pm0.53 9.03±\pm0.59 6.70±\pm0.31
Adversarial (ε=0.03)\varepsilon=0.03) FGSM LeNet 52.78±\pm0.66 47.97±\pm1.03 51.27±\pm0.72 52.55±\pm0.36 41.96±\pm0.35 38.79±\pm0.62 28.46±\pm0.53 24.31±\pm0.29 46.34±\pm0.31
ResNet 44.64±\pm0.35 34.32±\pm0.35 44.93±\pm0.25 48.36±\pm0.51 37.47±\pm1.66 47.83±\pm0.88 19.87±\pm0.62 18.41±\pm0.58 34.56±\pm0.98
ResNeXt 42.80±\pm0.42 35.49±\pm0.87 42.91±\pm1.44 46.38±\pm0.45 34.04±\pm0.52 47.86±\pm0.92 19.36±\pm0.99 18.00±\pm0.11 33.21±\pm0.83
WResNet 43.14±\pm0.23 32.03±\pm0.26 43.51±\pm0.38 43.86±\pm0.23 34.62±\pm0.38 49.62±\pm0.49 18.35±\pm0.30 14.44±\pm0.70 29.61±\pm0.86
SwinT 60.22±\pm0.75 64.77±\pm0.29 61.12±\pm1.56 67.92±\pm0.30 56.16±\pm0.75 20.92±\pm1.27 20.71±\pm0.23 19.24±\pm1.13 27.33±\pm0.90
Mean 48.72±\pm0.48 42.92±\pm0.56 48.75±\pm0.87 51.81±\pm0.37 40.85±\pm0.73 41.00±\pm0.84 21.35±\pm0.53 18.88±\pm0.56 34.21±\pm0.78
PGD40 LeNet 75.38±\pm0.46 72.20±\pm1.46 73.34±\pm0.85 73.94±\pm0.36 61.55±\pm0.22 53.27±\pm0.18 41.87±\pm0.73 37.84±\pm0.32 59.10±\pm0.99
ResNet 78.05±\pm0.16 63.15±\pm0.49 78.59±\pm0.69 81.43±\pm0.72 70.82±\pm0.25 70.11±\pm1.23 41.84±\pm0.54 37.65±\pm0.69 60.51±\pm0.85
ResNeXt 76.11±\pm0.42 66.02±\pm0.78 75.58±\pm0.77 81.23±\pm0.33 66.11±\pm0.12 68.90±\pm0.96 37.28±\pm0.95 35.31±\pm0.49 53.21±\pm0.73
WResNet 77.08±\pm0.18 62.63±\pm0.31 77.89±\pm0.51 80.49±\pm0.51 68.04±\pm0.37 71.02±\pm0.97 39.78±\pm0.78 37.81±\pm0.37 48.36±\pm0.82
SwinT 72.71±\pm0.57 87.26±\pm0.65 67.60±\pm1.09 83.08±\pm1.05 71.44±\pm1.01 21.66±\pm1.07 49.70±\pm0.93 42.81±\pm0.84 67.60±\pm0.88
Mean 75.87±\pm0.36 70.25±\pm0.74 74.60±\pm0.78 80.03±\pm0.59 67.59±\pm0.39 56.99±\pm0.88 42.09±\pm0.79 38.28±\pm0.54 57.76±\pm0.85
Adversarial (ε=0.3)\varepsilon=0.3) FGSM LeNet 86.36±\pm0.41 78.38±\pm1.07 85.92±\pm1.32 89.54±\pm0.90 52.84±\pm0.72 51.27±\pm1.04 50.11±\pm0.75 47.19±\pm0.77 52.56±\pm0.81
ResNet 75.96±\pm1.39 62.41±\pm1.96 75.58±\pm0.95 82.52±\pm0.24 49.82±\pm0.95 32.84±\pm0.63 41.21±\pm1.51 36.55±\pm1.24 52.09±\pm0.84
ResNeXt 75.58±\pm0.59 62.91±\pm1.25 76.46±\pm1.24 81.04±\pm0.82 34.05±\pm0.94 32.68±\pm0.80 38.82±\pm0.63 35.39±\pm0.68 51.29±\pm0.84
WResNet 69.53±\pm0.88 57.65±\pm1.11 73.30±\pm1.06 78.05±\pm0.45 38.73±\pm0.98 27.56±\pm0.73 38.58±\pm0.85 34.56±\pm0.38 45.55±\pm0.95
SwinT 91.05±\pm0.83 87.31±\pm1.51 90.91±\pm1.94 92.47±\pm1.26 14.89±\pm2.59 79.70±\pm1.16 37.51±\pm1.09 32.89±\pm1.15 83.56±\pm0.73
Mean 79.70±\pm0.82 69.73±\pm1.38 80.43±\pm1.30 84.72±\pm0.73 38.07±\pm1.24 44.81±\pm0.87 41.25±\pm0.97 37.32±\pm0.84 57.01±\pm0.83
PGD40 LeNet 92.34±\pm0.76 91.14±\pm2.94 95.58±\pm0.48 93.21±\pm0.32 92.43±\pm0.68 81.01±\pm1.96 68.03±\pm0.33 65.66±\pm0.83 83.76±\pm1.25
ResNet 94.68±\pm0.41 82.66±\pm1.68 94.60±\pm0.35 95.39±\pm0.21 94.37±\pm0.31 81.07±\pm1.72 70.50±\pm0.31 66.05±\pm1.27 76.84±\pm1.20
ResNeXt 94.31±\pm0.44 86.32±\pm1.52 93.94±\pm1.28 95.41±\pm0.12 93.54±\pm0.77 78.74±\pm1.05 65.44±\pm0.82 63.20±\pm0.90 71.84±\pm0.98
WResNet 94.78±\pm0.53 83.34±\pm0.73 94.79±\pm0.32 95.53±\pm0.59 94.04±\pm0.75 77.35±\pm1.32 69.20±\pm0.35 67.99±\pm1.53 71.17±\pm1.53
SwinT 92.71±\pm0.82 94.79±\pm1.21 95.06±\pm1.04 94.71±\pm2.35 94.81±\pm1.83 89.03±\pm2.51 94.41±\pm0.56 91.24±\pm1.81 90.91±\pm1.57
Mean 93.76±\pm0.59 87.65±\pm1.62 94.79±\pm0.69 94.85±\pm0.72 93.84±\pm0.87 81.44±\pm1.71 73.52±\pm0.47 70.83±\pm1.27 78.90±\pm1.31
Table 3: Error rates of various methods across different architectures for CIFAR-10 dataset. LA improves Clean and mCE compared to standard training, and also exhibits superior robustness to adversarial examples and and generalization to attacks, even outperforming adversarial training.
Train
Std. Mixup AugMix AutoAug. FGSM PGD10 P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} AugMix++{}^{\mathbin{+\mkern-10.0mu+}}
Clean LeNet 39.84±\pm0.31 41.55±\pm0.54 39.41±\pm0.66 38.89±\pm0.34 41.93±\pm0.58 56.99±\pm0.89 38.80±\pm0.68 39.22±\pm0.42 40.07±\pm0.87
ResNet 31.10±\pm0.25 32.24±\pm0.53 30.87±\pm0.52 29.91±\pm0.32 33.97±\pm0.36 47.47±\pm0.63 30.67±\pm0.61 30.15±\pm0.47 29.72±\pm0.62
ResNeXt 32.95±\pm0.27 34.89±\pm0.68 32.03±\pm0.53 29.69±\pm0.32 35.66±\pm0.21 51.02±\pm0.76 30.69±\pm0.72 30.01±\pm0.36 30.92±\pm0.73
WResNet 29.43±\pm0.19 31.38±\pm0.37 28.91±\pm0.72 27.32±\pm0.37 32.26±\pm0.64 42.41±\pm2.01 28.62±\pm0.85 28.59±\pm0.31 27.45±\pm0.58
SwinT 27.92±\pm0.17 29.05±\pm0.29 27.47±\pm0.32 27.47±\pm0.23 30.93±\pm0.26 34.94±\pm1.99 27.23±\pm0.99 26.32±\pm0.25 27.39±\pm0.70
Mean 32.25±\pm0.24 33.82±\pm0.48 31.74±\pm0.55 30.66±\pm0.32 34.95±\pm0.41 46.57±\pm1.26 31.20±\pm0.77 30.86±\pm0.36 31.11±\pm0.70
mCE LeNet 57.27±\pm0.34 56.45±\pm0.98 51.97±\pm0.76 52.71±\pm0.36 54.18±\pm0.44 66.42±\pm0.55 53.61±\pm0.39 53.41±\pm0.23 54.96±\pm0.46
ResNet 52.81±\pm0.25 50.43±\pm0.95 44.46±\pm0.19 45.88±\pm0.27 48.50±\pm0.71 58.46±\pm0.58 46.58±\pm0.33 46.32±\pm0.25 47.08±\pm0.66
ResNeXt 52.62±\pm0.21 51.38±\pm0.44 45.12±\pm0.25 44.98±\pm0.29 49.41±\pm0.73 64.17±\pm0.63 46.66±\pm0.23 45.90±\pm0.22 46.45±\pm0.61
WResNet 48.33±\pm0.25 48.81±\pm0.57 41.78±\pm0.32 42.78±\pm0.25 46.74±\pm0.32 57.05±\pm1.76 44.48±\pm0.27 44.31±\pm0.39 43.41±\pm0.84
SwinT 45.73±\pm0.15 45.08±\pm0.32 40.87±\pm0.21 42.45±\pm0.22 43.75±\pm0.39 45.16±\pm1.96 44.73±\pm0.84 41.37±\pm0.21 43.25±\pm0.89
Mean 51.35±\pm0.24 50.43±\pm0.65 44.84±\pm0.35 45.76±\pm0.28 48.52±\pm0.52 58.25±\pm1.10 47.21±\pm0.41 46.26±\pm0.26 47.03±\pm0.69
Calibration ECE LeNet 15.79±\pm0.21 10.20±\pm0.98 11.49±\pm0.31 6.37±\pm0.21 15.57±\pm0.91 25.32±\pm1.02 14.08±\pm0.96 13.47±\pm0.54 10.61±\pm0.24
ResNet 18.40±\pm0.23 9.59±\pm1.71 15.76±\pm0.31 12.32±\pm0.23 15.57±\pm1.65 29.32±\pm1.25 15.51±\pm0.42 14.15±\pm0.73 12.73±\pm0.32
ResNeXt 20.83±\pm0.17 11.35±\pm0.85 17.52±\pm0.34 12.81±\pm0.27 21.64±\pm0.92 33.70±\pm1.03 15.49±\pm1.56 14.24±\pm0.65 12.87±\pm0.56
WResNet 18.51±\pm0.33 10.10±\pm0.77 16.54±\pm0.36 13.08±\pm0.25 19.76±\pm0.73 27.52±\pm1.11 16.97±\pm1.01 16.41±\pm0.39 11.88±\pm0.80
SwinT 14.14±\pm0.35 6.36±\pm1.06 11.39±\pm0.32 7.31±\pm0.19 14.64±\pm1.60 19.37±\pm0.72 14.02±\pm0.88 12.12±\pm0.87 11.09±\pm0.72
Mean 17.53±\pm0.26 9.52±\pm1.07 14.54±\pm0.33 10.38±\pm0.23 17.44±\pm1.16 27.05±\pm1.03 15.21±\pm0.97 14.08±\pm0.64 11.84±\pm0.53
RMS LeNet 18.43±\pm0.23 11.35±\pm1.23 13.59±\pm0.73 28.08±\pm1.81 18.08±\pm1.92 27.59±\pm1.08 16.01±\pm1.23 17.44±\pm0.88 12.81±\pm1.25
ResNet 22.11±\pm0.27 10.89±\pm1.42 19.21±\pm0.32 15.41±\pm0.22 22.16±\pm2.55 32.28±\pm1.23 14.53±\pm1.08 18.04±\pm0.61 15.66±\pm1.02
ResNeXt 24.64±\pm0.33 12.55±\pm1.52 21.15±\pm0.89 15.87±\pm0.34 25.39±\pm1.03 36.39±\pm1.12 15.64±\pm1.15 15.53±\pm0.71 15.71±\pm0.85
WResNet 22.94±\pm0.46 11.21±\pm1.18 20.42±\pm0.77 16.28±\pm0.23 23.91±\pm0.78 30.80±\pm0.95 15.18±\pm0.87 18.15±\pm0.43 15.39±\pm0.83
SwinT 17.58±\pm0.39 7.56±\pm1.22 14.59±\pm0.25 9.39±\pm0.21 17.86±\pm1.33 22.86±\pm0.73 12.52±\pm0.73 14.49±\pm0.47 15.22±\pm0.74
Mean 21.14±\pm0.34 10.71±\pm1.31 17.79±\pm0.59 27.41±\pm0.26 21.48±\pm1.52 29.98±\pm1.02 14.78±\pm1.01 16.53±\pm0.62 14.96±\pm0.94
Adversarial (ε=0.03)\varepsilon=0.03) FGSM LeNet 69.41±\pm0.25 67.51±\pm0.96 68.71±\pm0.87 70.62±\pm0.52 63.28±\pm0.35 64.11±\pm2.03 56.95±\pm0.66 57.16±\pm0.85 64.98±\pm1.12
ResNet 64.16±\pm0.44 58.07±\pm0.94 63.03±\pm0.85 65.54±\pm0.48 57.62±\pm1.66 64.93±\pm1.32 46.78±\pm0.73 46.31±\pm0.39 51.76±\pm1.11
ResNeXt 62.55±\pm0.47 58.67±\pm0.73 63.41±\pm0.99 63.77±\pm0.53 56.74±\pm0.52 64.52±\pm1.16 46.81±\pm0.52 46.04±\pm0.91 50.17±\pm0.84
WResNet 61.97±\pm0.42 55.38±\pm0.82 62.16±\pm1.83 63.52±\pm0.58 55.94±\pm0.38 62.29±\pm1.62 48.01±\pm0.65 43.46±\pm0.62 48.27±\pm1.03
SwinT 75.35±\pm0.32 74.78±\pm0.88 75.23±\pm2.03 78.44±\pm0.89 66.96±\pm0.75 49.39±\pm1.91 47.12±\pm0.49 50.64±\pm0.88 49.19±\pm0.98
Mean 66.69±\pm0.38 62.88±\pm0.87 66.51±\pm1.31 68.38±\pm0.60 60.11±\pm0.73 61.05±\pm1.61 49.13±\pm0.61 48.72±\pm0.73 52.87±\pm1.02
PGD40 LeNet 82.87±\pm0.38 82.41±\pm1.26 82.42±\pm1.05 84.42±\pm0.23 77.31±\pm0.22 73.31±\pm0.29 72.09±\pm0.85 73.07±\pm0.30 79.98±\pm1.09
ResNet 81.78±\pm0.32 78.05±\pm1.18 82.03±\pm0.94 84.45±\pm0.38 75.82±\pm0.25 76.88±\pm0.76 62.99±\pm0.94 62.46±\pm0.68 72.02±\pm0.96
ResNeXt 80.25±\pm0.43 77.18±\pm1.86 81.34±\pm1.12 83.42±\pm0.77 74.66±\pm0.12 76.23±\pm0.85 64.41±\pm1.15 64.74±\pm0.89 69.10±\pm0.85
WResNet 82.01±\pm0.35 76.36±\pm1.07 82.19±\pm1.32 83.87±\pm0.31 76.51±\pm0.37 75.70±\pm1.11 63.10±\pm0.79 61.42±\pm0.77 70.81±\pm0.89
SwinT 84.33±\pm0.65 89.05±\pm0.95 85.34±\pm1.17 88.29±\pm0.96 74.29±\pm1.01 52.94±\pm1.35 63.23±\pm0.25 73.85±\pm0.94 71.06±\pm1.01
Mean 82.25±\pm0.43 80.61±\pm1.26 82.66±\pm1.12 84.89±\pm0.53 75.72±\pm0.39 71.01±\pm0.87 65.16±\pm0.80 67.11±\pm0.72 72.59±\pm0.96
Adversarial (ε=0.3)\varepsilon=0.3) FGSM LeNet 91.92±\pm0.57 90.73±\pm1.21 89.97±\pm1.52 92.91±\pm0.87 73.54±\pm0.72 69.49±\pm1.25 81.85±\pm1.14 81.85±\pm0.86 87.48±\pm0.87
ResNet 88.24±\pm0.78 84.96±\pm1.23 85.58±\pm1.23 87.94±\pm0.36 69.55±\pm0.95 63.27±\pm1.06 75.39±\pm1.37 73.61±\pm0.95 76.77±\pm0.91
ResNeXt 87.21±\pm0.79 84.69±\pm1.38 82.93±\pm1.43 85.71±\pm0.39 69.68±\pm0.94 63.29±\pm1.03 74.69±\pm0.98 70.92±\pm0.73 78.05±\pm1.03
WResNet 85.46±\pm0.82 83.94±\pm1.63 82.46±\pm1.47 84.64±\pm0.52 64.95±\pm0.98 60.71±\pm1.23 71.62±\pm1.01 67.86±\pm0.79 74.27±\pm0.95
SwinT 91.38±\pm0.99 89.87±\pm1.65 90.09±\pm1.86 92.53±\pm0.99 40.83±\pm2.59 79.08±\pm1.17 73.91±\pm0.96 74.75±\pm1.06 74.57±\pm0.84
Mean 88.84±\pm0.79 86.84±\pm1.42 86.21±\pm1.50 88.75±\pm0.63 63.71±\pm1.24 67.17±\pm1.15 75.49±\pm1.09 73.80±\pm0.88 78.23±\pm0.92
PGD40 LeNet 92.75±\pm0.45 94.39±\pm1.23 92.66±\pm0.89 94.46±\pm0.22 92.94±\pm0.68 83.50±\pm1.46 86.77±\pm0.76 86.33±\pm0.53 92.21±\pm1.15
ResNet 93.29±\pm0.61 91.69±\pm1.42 92.01±\pm0.88 93.69±\pm0.17 92.76±\pm0.31 81.35±\pm1.95 81.45±\pm0.92 80.79±\pm0.57 89.24±\pm0.55
ResNeXt 92.63±\pm0.72 92.37±\pm1.03 91.62±\pm0.72 93.61±\pm0.33 92.56±\pm0.77 79.65±\pm1.20 83.98±\pm0.84 83.91±\pm0.44 86.21±\pm0.98
WResNet 93.11±\pm0.79 90.55±\pm0.97 92.21±\pm0.74 93.71±\pm0.15 92.33±\pm0.75 82.85±\pm1.65 83.02±\pm0.47 82.00±\pm0.68 87.79±\pm1.23
SwinT 93.92±\pm1.03 95.55±\pm1.76 93.39±\pm1.11 95.63±\pm1.02 92.23±\pm1.83 91.64±\pm1.93 82.15±\pm0.94 92.87±\pm0.85 88.33±\pm1.11
Mean 93.14±\pm0.72 92.91±\pm1.28 92.38±\pm0.87 94.22±\pm0.38 92.56±\pm0.87 83.80±\pm1.64 83.47±\pm0.79 85.18±\pm0.61 88.76±\pm1.00
Table 4: Error rates of various methods across different architectures for the CIFAR-100 dataset. LA improves Clean, mCE, and adversarial errors compared to standard training. Also, under lower-budget attacks, it consistently outperforms adversarial training.

Comparisons of LS and MTL versus LA. Table 5 presents a summary of the comparison between LS and MTL with LA. For CIFAR-10, the improvement in clean error is 5.22%, 3.66%, and 15.57% for LS, MTL, and LA, respectively. The ECE increases by 32.96% and 14.26% for LS and MTL, respectively, whereas LA enhances both ECE and RMS calibration error by up to 1.90%.

In both low and high ε\varepsilon budget attacks, LA exhibits superior improvement compared to LS and MTL, with enhancements of up to 50.29% and 28.27% for FGSM and PGD with ε=0.3\varepsilon=0.3, respectively. Nevertheless, the effectiveness of LS in improving adversarial error for CIFAR-10 data is noticeable, and have been highlighted in findings of (Shafahi et al., 2019; Pang et al., 2020; Ren et al., 2021), as well.

Similarly, in CIFAR-100, LA outperforms in mCE improvement with 8.32% error reduction. In terms of adversarial errors, it improves the baseline by 29.87% for FGSM and 25.11% for PGD under a low budget ε=0.03\varepsilon=0.03, as well as by 20.59% and 11.93% with a higher budget of ε=0.3\varepsilon=0.3 under FGSM and PGD attacks, respectively.

Train
Err. Std. LS. MTL. P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}
Clean 9.57±\pm0.21 9.07±\pm1.25 9.22±\pm0.23 8.08±\pm0.19
mCE 22.73±\pm0.27 21.89±\pm1.79 20.59±\pm0.58 16.78±\pm0.71
ECE 6.31±\pm0.21 8.39±\pm2.03 7.21±\pm0.81 6.19±\pm0.26
RMS 10.37±\pm0.38 10.92±\pm1.88 11.17±\pm0.25 8.96±\pm0.43
ε=0.03\varepsilon=0.03 FGSM 43.14±\pm0.23 27.01±\pm1.02 41.31±\pm0.19 14.44±\pm0.70
PGD 77.08±\pm0.18 56.6±\pm1.21 66.06±\pm0.62 37.81±\pm0.37
ε=0.3\varepsilon=0.3 FGSM 69.53±\pm0.88 52.36±\pm2.66 71.38±\pm0.55 34.56±\pm0.38
PGD 94.78±\pm0.53 82.43±\pm1.54 93.36±\pm0.73 67.99±\pm1.53

CIFAR-10

Train
Err. Std. LS. MTL. P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}
Clean 29.43±\pm0.19 27.85±\pm1.38 32.23±\pm0.45 28.59±\pm0.31
mCE 48.33±\pm0.25 47.11±\pm1.91 48.32±\pm0.56 44.31±\pm0.39
ECE 18.51±\pm0.33 7.89±\pm1.83 22.26±\pm0.92 16.41±\pm0.39
RMS 22.94±\pm0.46 8.78±\pm1.45 26.79±\pm0.33 18.15±\pm0.43
ε=0.03\varepsilon=0.03 FGSM 61.97±\pm0.42 53.21±\pm0.98 62.01±\pm0.36 43.46±\pm0.62
PGD 82.01±\pm0.35 76.22±\pm1.10 77.06±\pm0.74 61.42±\pm0.77
ε=0.3\varepsilon=0.3 FGSM 85.46±\pm0.82 80.44±\pm1.55 84.04±\pm0.92 67.86±\pm0.79
PGD 93.11±\pm0.79 91.66±\pm2.03 90.16±\pm0.65 82.00±\pm0.68

CIFAR-100

Table 5: Performance comparisons of Label Smoothing and Multi-task Learning to LA (P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}) using the Wide ResNet-50 architecture. For both CIFAR-10 and CIFAR-100, LA improves Clean, mCE, Calibration, and adversarial errors compared to standard training and consistently outperforms LS and MTL in improving adversarial robustness.

5 Conclusion

To align our naming convention and label assignment when training the DNNs, we developed Label Augmentation. Essentially, LA assigns one-hot labels to each of the operations used during augmentations. Then, instead of solely augmenting transformed data in the training pipeline, LA involves augmenting labels by concatenating input labels with operation labels, using a factor of δ\delta. This automatically enriches the labels without requiring extra human annotation and has proved to be advantageous in enhancing both robustness against common and adversarial perturbations. In terms of Clean and mCE error, comparative analysis shows LA performing nearly as well as AugMix and AutoAugment. However, in terms of adversarial robustness, LA is significantly better than other augmentation methods and can even outperform adversarial training. LA is flexible and could be employed in other modalities. For instance, future works can utilize LA in audio inputs while training with noisy audio signals. This study, alongside much of the existing research on distributional shift, primarily focuses on evaluating model robustness to 2D image transformations, largely overlooking changes in viewpoint within 3D transformations found in various real-world applications (e.g., autonomous driving). It has been demonstrated that common image classifiers are highly vulnerable to adversarial viewpoints (Dong et al., 2022). Future studies could explore whether employing LA—with rotation as augmentation—can enhance robustness against adversarial viewpoints.

6 Acknowledgement

This work was conducted with the financial support of the Science Foundation Ireland Centre for Research Training in Artificial Intelligence under Grant No. 18/CRT/6223. We would also like to thank the anonymous reviewers for their helpful and informative comments.

References

  • Andriushchenko et al. (2020) Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European conference on computer vision, pp.  484–501. Springer, 2020.
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.  274–283. PMLR, 2018.
  • Azulay & Weiss (2018) Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
  • Bai et al. (2021) Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356, 2021.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Bu et al. (2023) Qingwen Bu, Dong Huang, and Heming Cui. Towards building more robust models with frequency bias. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4402–4411, 2023.
  • Cai et al. (2023) Yi Cai, Xuefei Ning, Huazhong Yang, and Yu Wang. Ensemble-in-one: ensemble learning within random gated networks for enhanced adversarial robustness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  14738–14747, 2023.
  • Carmon et al. (2019) Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. Advances in neural information processing systems, 32, 2019.
  • Croce & Hein (2020a) Francesco Croce and Matthias Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. In International Conference on Machine Learning, pp.  2196–2205. PMLR, 2020a.
  • Croce & Hein (2020b) Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp.  2206–2216. PMLR, 2020b.
  • Cubuk et al. (2019) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  113–123, 2019.
  • Deng et al. (2021a) Zhijie Deng, Xiao Yang, Shizhen Xu, Hang Su, and Jun Zhu. Libre: A practical bayesian approach to adversarial detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  972–982, 2021a.
  • Deng et al. (2021b) Zhun Deng, Linjun Zhang, Amirata Ghorbani, and James Zou. Improving adversarial robustness via unlabeled out-of-domain data. In International Conference on Artificial Intelligence and Statistics, pp.  2845–2853. PMLR, 2021b.
  • DeVries & Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • Dodge & Karam (2016) Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks. In 2016 eighth international conference on quality of multimedia experience (QoMEX), pp.  1–6. IEEE, 2016.
  • Dong et al. (2022) Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu. Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. Advances in Neural Information Processing Systems, 35:36789–36803, 2022.
  • Eom & Ham (2019) Chanho Eom and Bumsub Ham. Learning disentangled representation for robust person re-identification. Advances in neural information processing systems, 32, 2019.
  • Ford et al. (2019) Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019.
  • Gabbay & Hoshen (2019) Aviv Gabbay and Yedid Hoshen. Demystifying inter-class disentanglement. arXiv preprint arXiv:1906.11796, 2019.
  • Gawlikowski et al. (2021) Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342, 2021.
  • Geirhos et al. (2018) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Gowal et al. (2021) Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia Wiles, Florian Stimberg, Dan Andrei Calian, and Timothy A Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218–4233, 2021.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp.  1321–1330. PMLR, 2017.
  • HarryKim (2020) HarryKim. Torchattacks a pytorch library for adversarial attacks, 2020. URL https://adversarial-attacks-pytorch.readthedocs.io/en/latest/. Accessed: 2024-05-23.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hendrycks & Dietterich (2019) Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  • Hendrycks et al. (2019) Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  • Hong et al. (2021) Minui Hong, Jinwoo Choi, and Gunhee Kim. Stylemix: Separating content and style for enhanced data augmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14862–14870, 2021.
  • Inoue (2018) Hiroshi Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
  • Jackson et al. (2019) Philip TG Jackson, Amir Atapour Abarghouei, Stephen Bonner, Toby P Breckon, and Boguslaw Obara. Style augmentation: data augmentation via style randomization. In CVPR workshops, volume 6, pp.  10–11, 2019.
  • Kireev et al. (2022) Klim Kireev, Maksym Andriushchenko, and Nicolas Flammarion. On the effectiveness of adversarial training against common corruptions. In Uncertainty in Artificial Intelligence, pp.  1012–1021. PMLR, 2022.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  • Laugros et al. (2019) Alfred Laugros, Alice Caplier, and Matthieu Ospici. Are adversarial robustness and common perturbation robustness independant attributes? In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.  0–0, 2019.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2022) Tao Li, Yingwen Wu, Sizhe Chen, Kun Fang, and Xiaolin Huang. Subspace adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13409–13418, 2022.
  • Lightning (2024) PyTorch Lightning. torchmetrics: Metrics for pytorch. https://lightning.ai/docs/torchmetrics/stable/classification/calibration_error.html, 2024. Accessed: April 5, 2024.
  • Liu et al. (2021a) Jiashuo Liu, Zheyan Shen, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021a.
  • Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021b.
  • Lopes et al. (2019) Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
  • Loshchilov & Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • Lukasik et al. (2020) Michal Lukasik, Srinadh Bhojanapalli, Aditya Menon, and Sanjiv Kumar. Does label smoothing mitigate label noise? In International Conference on Machine Learning, pp.  6448–6458. PMLR, 2020.
  • Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Metzen et al. (2017) Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
  • Mintun et al. (2021) Eric Mintun, Alexander Kirillov, and Saining Xie. On interaction between augmentations and corruptions in natural corruption robustness. Advances in Neural Information Processing Systems, 34:3571–3583, 2021.
  • Modas et al. (2022) Apostolos Modas, Rahul Rade, Guillermo Ortiz-Jiménez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Prime: A few primitives can boost robustness to common corruptions. In European Conference on Computer Vision, pp.  623–640. Springer, 2022.
  • Moosavi-Dezfooli et al. (2016) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2574–2582, 2016.
  • Müller et al. (2019) Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  • Nicolaou et al. (2022) Anguelos Nicolaou, Vincent Christlein, Edgar Riba, Jian Shi, Georg Vogeler, and Mathias Seuret. Tormentor: Deterministic dynamic-path, data augmentations with fractals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2707–2711, 2022.
  • Ovadia et al. (2019) Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  • Pang et al. (2018) Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. Towards robust detection of adversarial examples. Advances in neural information processing systems, 31, 2018.
  • Pang et al. (2020) Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. arXiv preprint arXiv:2010.00467, 2020.
  • Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016a.
  • Papernot et al. (2016b) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp.  582–597. IEEE, 2016b.
  • Papernot et al. (2018) Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2018.
  • Qin et al. (2022) Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, and Xuezhi Wang. Understanding and improving robustness of vision transformers through patch-based negative augmentation. Advances in Neural Information Processing Systems, 35:16276–16289, 2022.
  • Ren et al. (2021) Qibing Ren, Liangliang Shi, Lanjun Wang, and Junchi Yan. Adversarial robustness via adaptive label smoothing. 2021.
  • Rusak et al. (2020) Evgenia Rusak, Lukas Schott, Roland S Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. A simple way to make neural networks robust against diverse image corruptions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp.  53–69. Springer, 2020.
  • Shafahi et al. (2019) Ali Shafahi, Amin Ghiasi, Furong Huang, and Tom Goldstein. Label smoothing and logit squeezing: A replacement for adversarial training? arXiv preprint arXiv:1910.11585, 2019.
  • Shorten & Khoshgoftaar (2019) Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  • Standley et al. (2020) Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Which tasks should be learned together in multi-task learning? In International conference on machine learning, pp.  9120–9132. PMLR, 2020.
  • Summers & Dinneen (2019) Cecilia Summers and Michael J Dinneen. Improved mixed-example data augmentation. In 2019 IEEE winter conference on applications of computer vision (WACV), pp.  1262–1270. IEEE, 2019.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  • Tramèr et al. (2017a) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017a.
  • Tramèr et al. (2017b) Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017b.
  • Träuble et al. (2021) Frederik Träuble, Elliot Creager, Niki Kilbertus, Francesco Locatello, Andrea Dittadi, Anirudh Goyal, Bernhard Schölkopf, and Stefan Bauer. On disentangled representations learned from correlated data. In International Conference on Machine Learning, pp.  10401–10412. PMLR, 2021.
  • Vasiljevic et al. (2016) Igor Vasiljevic, Ayan Chakrabarti, and Gregory Shakhnarovich. Examining the impact of blur on recognition by convolutional networks. arXiv preprint arXiv:1611.05760, 2016.
  • Wang et al. (2023a) Deng-Bao Wang, Lanqing Li, Peilin Zhao, Pheng-Ann Heng, and Min-Ling Zhang. On the pitfall of mixup for uncertainty calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7609–7618, 2023a.
  • Wang et al. (2021) Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and Zhangyang Wang. Augmax: Adversarial composition of random augmentations for robust training. Advances in neural information processing systems, 34:237–250, 2021.
  • Wang et al. (2023b) Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, and Shuicheng Yan. Better diffusion models further improve adversarial training. In International Conference on Machine Learning, pp.  36246–36263. PMLR, 2023b.
  • Wu et al. (2020) Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in neural information processing systems, 33:2958–2969, 2020.
  • Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1492–1500, 2017.
  • Xin et al. (2022) Derrick Xin, Behrooz Ghorbani, Justin Gilmer, Ankush Garg, and Orhan Firat. Do current multi-task optimization methods in deep learning even help? Advances in neural information processing systems, 35:13597–13609, 2022.
  • Xu et al. (2023) Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognition, pp.  109347, 2023.
  • Xu et al. (2017) Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
  • Yao et al. (2022) Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.  25407–25437. PMLR, 2022.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhang & Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
  • Zhong et al. (2020) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  13001–13008, 2020.
  • Zini et al. (2022) Simone Zini, Alex Gomez-Villa, Marco Buzzelli, Bartłomiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. Planckian jitter: countering the color-crippling effects of color jitter on self-supervised training. arXiv preprint arXiv:2202.07993, 2022.

Appendix A Appendix

The following includes results for evaluating with AutoAttack, employing a wider range of augmentation operations when using LA, and percentages of changes for both CIFAR-10 and CIFAR-100 datasets compared to the baseline.

The following Tables 6 and 7 presents the percentage of improvement of methods compared to the standard training, for CIFAR-10 and CIFAR-100, respectively. Next, table 10 summarizes the impact of incorporating additional operations and intensity levels.

Train
Std. Mixup AugMix AutoAug. FGSM PGD10 P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} AugMix++{}^{\mathbin{+\mkern-10.0mu+}}
Clean LeNet 0.00 12.29 -1.00 -3.21 -0.94 77.49 -5.74 -13.89 17.90
ResNet 0.00 3.68 -1.89 -9.34 5.27 117.10 -11.63 -15.90 -5.27
ResNeXt 0.00 3.52 -3.52 -16.80 3.96 115.83 -19.26 -24.01 -5.54
WResNet 0.00 8.67 -1.67 -12.85 7.31 162.59 -16.82 -15.57 -0.94
SwinT 0.00 -15.10 -28.72 -5.69 -28.44 14.61 -41.64 -42.98 -28.72
Mean 0.00 2.14 -8.29 -8.92 -4.17 90.00 -19.53 -23.29 -4.42
mCE LeNet 0.00 -0.11 -15.58 -10.94 -0.74 43.98 -14.84 -16.04 9.35
ResNet 0.00 -7.12 -24.28 -21.21 -9.35 51.72 -24.89 -25.05 -4.05
ResNeXt 0.00 -3.91 -24.49 -23.00 -10.27 63.11 -25.45 -27.10 6.52
WResNet 0.00 -5.19 -25.16 -21.29 -7.66 87.33 -26.79 -26.18 0.09
SwinT 0.00 0.58 -22.43 -0.13 -12.59 1.17 -28.30 -28.22 -22.43
Mean 0.00 -3.05 -22.15 -15.19 -7.90 49.00 -23.73 -24.23 -1.69
Calibration ECE LeNet 0.00 125.77 -24.15 -57.00 -4.86 125.18 -8.98 12.67 -67.89
ResNet 0.00 169.48 -14.47 -41.34 -4.77 159.46 -7.47 -7.63 -51.67
ResNeXt 0.00 148.18 -20.83 -48.44 0.52 150.65 -19.53 -24.09 -28.65
WResNet 0.00 192.39 -12.04 -34.55 2.22 298.73 -19.02 -1.90 -2.22
SwinT 0.00 111.38 -19.85 -35.35 18.89 -26.15 25.18 22.03 -7.26
Mean 0.00 151.67 -18.37 -44.33 1.06 153.43 -8.78 -2.18 -33.65
RMS LeNet 0.00 65.98 -21.06 -50.62 -7.47 87.66 -3.42 -1.56 -61.93
ResNet 0.00 69.31 -14.13 -36.40 -7.45 93.32 -3.97 -11.13 -51.31
ResNeXt 0.00 64.18 -15.34 -41.63 -4.56 88.31 -16.92 -20.73 -30.76
WResNet 0.00 85.25 -11.09 -25.65 -1.35 175.41 -17.45 -13.60 -4.44
SwinT 0.00 49.07 0.62 -27.31 -3.70 36.57 39.04 22.69 0.62
Mean 0.00 68.09 -13.20 -37.01 -4.95 100.86 -4.21 -7.65 -31.51
Adversarial (ε=0.03)\varepsilon=0.03) FGSM LeNet 0.00 -9.11 -2.86 -0.44 -20.50 -26.51 -46.08 -53.94 -12.20
ResNet 0.00 -23.12 0.65 8.33 -16.06 7.15 -55.49 -58.76 -22.58
ResNeXt 0.00 -17.08 0.26 8.36 -20.47 11.82 -54.77 -57.94 -22.41
WResNet 0.00 -25.75 0.86 1.67 -19.75 15.02 -57.46 -66.53 -31.36
SwinT 0.00 7.56 1.49 12.79 -6.74 -65.26 -65.61 -68.05 -54.62
Mean 0.00 -11.91 0.07 6.36 -16.15 -15.83 -56.17 -61.24 -29.78
PGD40 LeNet 0.00 -4.22 -2.71 -1.91 -18.35 -29.33 -44.45 -49.80 -21.60
ResNet 0.00 -19.09 0.69 4.33 -9.26 -10.17 -46.39 -51.76 -22.47
ResNeXt 0.00 -13.26 -0.70 6.73 -13.14 -9.47 -51.02 -53.61 -30.09
WResNet 0.00 -18.75 1.05 4.42 -11.73 -7.86 -48.39 -50.95 -37.26
SwinT 0.00 20.01 -7.03 14.26 -1.75 -70.21 -31.65 -41.12 -7.03
Mean 0.00 -7.40 -1.67 5.49 -10.91 -24.88 -44.52 -49.54 -23.87
Adversarial (ε=0.3)\varepsilon=0.3) FGSM LeNet 0.00 -9.24 -0.51 3.68 -38.81 -40.63 -41.98 -45.36 -39.14
ResNet 0.00 -17.84 -0.50 8.64 -34.41 -56.77 -45.75 -51.88 -31.42
ResNeXt 0.00 -16.76 1.16 7.22 -54.95 -56.76 -48.64 -53.18 -32.14
WResNet 0.00 -17.09 5.42 12.25 -44.30 -60.36 -44.51 -50.29 -34.49
SwinT 0.00 -4.11 -0.15 1.56 -61.68 -12.47 -58.80 -63.88 -8.23
Mean 0.00 -12.50 0.93 6.31 -47.22 -43.77 -48.25 -53.18 -28.47
PGD40 LeNet 0.00 -1.30 3.51 0.94 0.10 -12.27 -26.33 -28.89 -9.29
ResNet 0.00 -12.70 -0.08 0.75 -0.33 -14.37 -25.54 -30.24 -18.84
ResNeXt 0.00 -8.47 -0.39 1.17 -0.82 -16.51 -30.61 -32.99 -23.83
WResNet 0.00 -12.07 0.01 0.79 -0.78 -18.39 -26.99 -28.27 -24.91
SwinT 0.00 2.24 2.53 2.16 2.27 -3.97 1.83 -1.59 -1.94
Mean 0.00 -6.52 1.10 1.16 0.08 -13.14 -21.59 -24.46 -15.85
Table 6: Percentages of Error rates of various methods across different architectures for CIFAR-10 when compared to Standard training. Negative values indicate an improvement in error rates when employing augmentation techniques.
Train
Std. Mixup AugMix AutoAug. FGSM PGD10 P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} AugMix++{}^{\mathbin{+\mkern-10.0mu+}}
Clean LeNet 0.00 4.29 -1.08 -2.38 5.25 43.05 -2.61 -1.56 0.58
ResNet 0.00 3.67 -0.74 -3.83 9.23 52.64 -1.38 -3.05 -4.44
ResNeXt 0.00 5.89 -2.79 -9.89 8.22 54.84 -6.86 -8.92 -6.16
WResNet 0.00 6.63 -1.77 -7.17 9.62 44.10 -2.75 -2.85 -6.73
SwinT 0.00 4.05 -1.61 -1.61 10.78 25.14 -2.47 -5.73 -1.90
Mean 0.00 4.88 -1.58 -4.94 8.38 44.40 -3.24 -4.31 -3.53
mCE LeNet 0.00 -1.43 -9.25 -7.96 -5.40 15.98 -6.39 -6.74 -4.03
ResNet 0.00 -4.51 -15.81 -13.12 -8.16 10.70 -11.80 -12.29 -10.85
ResNeXt 0.00 -2.36 -14.25 -14.52 -6.10 21.95 -11.33 -12.77 -11.73
WResNet 0.00 0.99 -13.55 -11.48 -3.29 18.04 -7.97 -8.32 -10.18
SwinT 0.00 -1.42 -10.63 -7.17 -4.33 -1.25 -2.19 -9.53 -5.42
Mean 0.00 -1.80 -12.68 -10.89 -5.52 13.44 -8.06 -9.91 -8.42
Calibration ECE LeNet 0.00 -35.40 -27.23 -59.66 -1.39 60.35 -10.83 -14.69 -32.81
ResNet 0.00 -47.88 -14.35 -33.04 -15.38 59.35 -15.71 -23.10 -30.82
ResNeXt 0.00 -45.51 -15.89 -38.50 3.89 61.79 -25.64 -31.64 -38.21
WResNet 0.00 -45.43 -10.64 -29.34 6.75 48.68 -8.32 -11.35 -35.82
SwinT 0.00 -55.02 -19.45 -48.30 3.54 36.99 -0.85 -14.29 -21.57
Mean 0.00 -45.71 -17.08 -40.81 -0.56 54.25 -13.23 -19.71 -32.50
RMS LeNet 0.00 -38.42 -26.26 52.36 -1.90 49.70 -13.13 -5.37 -30.49
ResNet 0.00 -50.75 -13.12 -30.30 0.23 46.00 -34.28 -18.41 -29.17
ResNeXt 0.00 -49.07 -14.16 -35.59 3.04 47.69 -36.53 -36.97 -36.24
WResNet 0.00 -51.13 -10.99 -29.03 4.23 34.26 -33.83 -20.88 -32.91
SwinT 0.00 -57.00 -17.01 -46.59 1.59 30.03 -28.78 -17.58 -13.42
Mean 0.00 -49.33 -15.84 -19.56 1.61 41.84 -30.10 -20.86 -29.24
Adversarial (ε=0.03)\varepsilon=0.03) FGSM LeNet 0.00 -2.74 -1.01 1.74 -8.83 -7.64 -17.95 -17.65 -6.38
ResNet 0.00 -9.49 -1.76 2.15 -10.19 1.20 -27.09 -27.82 -19.33
ResNeXt 0.00 -6.20 1.37 1.95 -9.29 3.15 -25.16 -26.39 -19.79
WResNet 0.00 -10.63 0.31 2.50 -9.73 0.52 -22.53 -29.87 -22.11
SwinT 0.00 -0.76 -0.16 4.10 -11.13 -34.45 -37.47 -32.79 -34.72
Mean 0.00 -5.71 -0.27 2.53 -9.87 -8.46 -26.32 -26.94 -20.71
PGD40 LeNet 0.00 -0.56 -0.54 1.87 -6.71 -11.54 -13.01 -11.83 -3.49
ResNet 0.00 -4.56 0.31 3.26 -7.29 -5.99 -22.98 -23.62 -11.93
ResNeXt 0.00 -3.83 1.36 3.95 -6.97 -5.01 -19.74 -19.33 -13.89
WResNet 0.00 -6.89 0.22 2.27 -6.71 -7.69 -23.06 -25.11 -13.66
SwinT 0.00 5.60 1.20 4.70 -11.91 -37.22 -25.02 -12.43 -15.74
Mean 0.00 -1.99 0.51 3.21 -7.94 -13.66 -20.77 -18.41 -11.74
Adversarial (ε=0.3)\varepsilon=0.3) FGSM LeNet 0.00 -1.29 -2.12 1.08 -20.00 -24.40 -10.96 -10.96 -4.83
ResNet 0.00 -3.72 -3.01 -0.34 -21.18 -28.30 -14.56 -16.58 -13.00
ResNeXt 0.00 -2.89 -4.91 -1.72 -20.10 -27.43 -14.36 -18.68 -10.50
WResNet 0.00 -1.78 -3.51 -0.96 -24.00 -28.96 -16.19 -20.59 -13.09
SwinT 0.00 -1.65 -1.41 1.26 -55.32 -13.46 -19.12 -18.20 -18.40
Mean 0.00 -2.26 -2.97 -0.11 -28.29 -24.40 -15.03 7.00 -11.95
PGD40 LeNet 0.00 1.77 -0.10 1.84 0.20 -9.97 -6.45 -6.92 -0.58
ResNet 0.00 -1.72 -1.37 0.43 -0.57 -12.80 -12.69 -13.40 -4.34
ResNeXt 0.00 -0.28 -1.09 1.06 -0.08 -14.01 -9.34 -9.41 -6.93
WResNet 0.00 -2.75 -0.97 0.64 -0.84 -11.02 -10.84 -4.77 -5.71
SwinT 0.00 1.74 -0.56 1.82 -1.80 -2.43 -12.53 -1.12 -5.95
Mean 0.00 -0.25 -0.82 1.16 -0.62 -10.03 -10.38 -7.11 -4.71
Table 7: Percentages of Error rates of various methods across different architectures for CIFAR-100 when compared to Standard training. Negative values indicate an improvement in error rates when employing augmentation techniques.

The impact of incorporating additional operations and intensity levels.

Table 10 presents the results obtained by employing additional augmentation operations and intensity levels while utilizing LA.

Train
Err. Std. LA.
Clean 9.57±\pm0.21 6.79±\pm1.72
mCE 22.73±\pm0.27 13.72±\pm1.07
ECE 6.31±\pm0.21 5.23±\pm0.66
RMS 10.37±\pm0.38 9.26±\pm0.40
ε=0.03\varepsilon=0.03 FGSM 43.14±\pm0.23 14.15±\pm4.24
PGD 77.08±\pm0.18 35.04±\pm0.44
ε=0.3\varepsilon=0.3 FGSM 69.53±\pm0.88 25.89±\pm5.69
PGD 94.78±\pm0.53 50±\pm3.20
AA 86.05±\pm- 72.07±\pm-
Table 8: CIFAR-10
Train
Err. Std. LA.
Clean 29.43±\pm0.19 27.74±\pm1.26
mCE 48.33±\pm0.25 40.52±\pm1.67
ECE 18.51±\pm0.33 18.13±\pm0.77
RMS 22.94±\pm0.46 22.37±\pm0.59
ε=0.03\varepsilon=0.03 FGSM 61.97±\pm0.42 39.65±\pm3.25
PGD 82.01±\pm0.35 56.82±\pm1.04
ε=0.3\varepsilon=0.3 FGSM 85.46±\pm0.82 56.67±\pm2.88
PGD 93.11±\pm0.79 71.36±\pm2.75
AA 93.97±\pm- 82.22±\pm-
Table 9: CIFAR-100
Table 10: The error rates of Wide ResNet-50 architectures when employing LA with seven augmentation types: Plasma, Gamma, PlanckianJitter, ColorJiggle, Equalize, Posterize, and Rain across three severity levels. The parameter δ\delta corresponds to the intensity of the added noise, setting δ=0.8,0.6,0.2\delta=0.8,0.6,0.2, for high, moderate, and slight noise. Similar trends in error rate improvements are observed with additional augmentation types. While improving the baseline error, in some cases, it exhibit a lower degree of improvement compared to employing fewer numbers of operations.

The performance under AutoAttack.

AutoAttack (Croce & Hein, 2020b) ensembles two parameter-free versions of the PGD attack, along with the Fast Adaptive Boundary attack (Croce & Hein, 2020a) and the Square Attack (Andriushchenko et al., 2020), to create a diverse testing framework. We evaluate using AutoAttack with LL_{\infty} constraints at ε=0.03\varepsilon=0.03. We use the implementations provided by the TorchAttack 3.5.1 library (HarryKim, 2020).

Tables 1113, and 13 present a summary of the performance of various training methods under AutoAttack across both the CIFAR-10 and CIFAR-100 datasets. While the robustness of the methods falls below their performance under FGSM or PGD attacks, LA consistently outperforms all other methods. Specifically, in comparison to standard training error, Mixup and PGD worsen it by 1.02% and 6.29%, respectively, while Augmix, AutoAugment, FGSM, PGD, P.G.++{}^{\mathbin{+\mkern-10.0mu+}}, P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}, and AugMix++{}^{\mathbin{+\mkern-10.0mu+}} improve it by 2.02%, 1.66%, 4.44%, 7.61%, 11.76%, and 13.76%, respectively, for CIFAR-10. For CIFAR-100, PGD shows a slight increase of 0.19%, whereas all other methods improve it by 0.73%, 1.63%, 1.78%, 1.49%, 6.68%, 5.64%, and 11.93% for Mixup, Augmix, AutoAugment, FGSM, P.G.++{}^{\mathbin{+\mkern-10.0mu+}}, P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}, and AugMix++{}^{\mathbin{+\mkern-10.0mu+}}, respectively. In LS and MTL, the improvement under AA are 2.12% and 4.37% for CIFAR-10 and 1.11% and 1.49% for CIFAR-100, respectively.

Train
Std. Mixup AugMix AutoAug. FGSM PGD10 P.G.++{}^{\mathbin{+\mkern-10.0mu+}} P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}} AugMix++{}^{\mathbin{+\mkern-10.0mu+}}
CIFAR-10 86.05 86.93 84.31 84.62 82.23 91.46 79.50 75.93 74.21
CIFAR-100 93.97 93.28 92.44 92.30 92.57 94.15 87.69 88.67 82.76
Table 11: The error rates of different methods using Wide ResNet-50 architectures for the CIFAR-10 dataset under AutoAttack with LL_{\infty} constraints at ε=0.03\varepsilon=0.03.
Train
Err. Std. LS. MTL. P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}
AA 86.05 84.23 82.29 75.93
Table 12: CIFAR-10 The error rates of MTL, LS, and LA methods using Wide ResNet-50 architectures for the CIFAR-10 dataset under AutoAttack with LL_{\infty} constraints at ε=0.03\varepsilon=0.03.
Train
Err. Std. LS. MTL. P.G.J.++{}^{\mathbin{+\mkern-10.0mu+}}
AA 93.97 92.93 92.57 88.67
Table 13: CIFAR-100 The error rates of MTL, LS, and LA methods using Wide ResNet-50 architectures for the CIFAR-100 dataset under AutoAttack with LL_{\infty} constraints at ε=0.03\varepsilon=0.03.

Percentages of error rate variations compared to the standard training.

Refer to caption
Figure 4: Percentages of error rate variations compared to the standard training on Wide ResNet-50: left side employing LA, right side with normal augmentations. While normal augmentation can enhance mCE to a considerable degree, it comes at the expense of Clean and calibration errors. On the other hand, regardless of the type and the number of operations used in augmenting with LA, we can see improvements in Clean, mCE, calibration, and adversarial errors. However, using two or three types of operations proves even more effective.