Understanding Robust Overfitting from the Feature Generalization Perspective

Chaojian Yu
The University of Sydney
[email protected]
&Xiaolong Shi
The University of Sydney
[email protected]
Jun Yu
University of Science and Technology of China
[email protected]
&Bo Han
Hong Kong Baptist University
[email protected]
&Tongliang Liu
The University of Sydney
[email protected]

Abstract

Adversarial training (AT) constructs robust neural networks by incorporating adversarial perturbations into natural data. However, it is plagued by the issue of robust overfitting (RO), which severely damages the model’s robustness. In this paper, we investigate RO from a novel feature generalization perspective. Specifically, we design factor ablation experiments to assess the respective impacts of natural data and adversarial perturbations on RO, identifying that the inducing factor of RO stems from natural data. Given that the only difference between adversarial and natural training lies in the inclusion of adversarial perturbations, we further hypothesize that adversarial perturbations degrade the generalization of features in natural data and verify this hypothesis through extensive experiments. Based on these findings, we provide a holistic view of RO from the feature generalization perspective and explain various empirical behaviors associated with RO. To examine our feature generalization perspective, we devise two representative methods, attack strength and data augmentation, to prevent the feature generalization degradation during AT. Extensive experiments conducted on benchmark datasets demonstrate that the proposed methods can effectively mitigate RO and enhance adversarial robustness.

1 Introduction

Adversarial training (AT) [26] has been proved to be reliable to improve a model’s robustness against adversarial attacks [33, 14]. It constructs robust neural networks by incorporating adversarial perturbations into natural data and training models using adversarial data generated on-the-fly. AT has been established as one of the most effective empirical defenses [4]. Although promising to improve the model’s robustness, AT is plagued by the issue of robust overfitting (RO) [30]. That is, after a certain point during AT, the model’s robust accuracy continues to substantially decrease with further training. This phenomenon has been consistently observed across different datasets, network architectures, and AT variants [30].

In this work, we investigate RO from a novel feature generalization perspective. Firstly, we conduct a series of factor ablation experiments to identify the inducing factors of RO. Specifically, we consider natural data and adversarial perturbations from small-loss training data as distinct factors and employ factor ablation adversarial training to assess their individual impacts on RO. We observe that simultaneously removing adversarial perturbations and natural data from small-loss training data greatly mitigate RO. However, experiments that only remove adversarial perturbations still exhibits a severe degree of RO. Given that these factor ablation experiments strictly adhere to the controlling variables principle, with the sole variation being the inclusion of natural data in the training set, we can infer that the inducing factor of RO stems from natural data.

Given that the only difference between adversarial and natural training lies in the inclusion of adversarial perturbations in the natural data, we further hypothesize that adversarial perturbations may degrade the generalization of features in natural data and lead to the emergence of RO. To validate this hypothesis, we introduce an additional adversarial perturbation into the training natural data during AT. We observe that as the additional adversarial perturbation is incorporated, the model’s test robustness deteriorates. This suggests that the additional adversarial perturbations have a detrimental effect on feature generalization. Furthermore, we are able to simulate the RO phenomenon during AT by linearly increasing the budget of additional adversarial perturbations. These experimental results demonstrate that adversarial perturbations can degrade feature generalization and induce the emergence of RO phenomenon.

Based on these findings, we provide a holistic view of RO from the feature generalization perspective. In standard AT, a robustness gap exists between the training and test data due to factors like the memorization effect of deep networks [3, 12] and the distribution deviation between the finite training and test data. Certain small-loss training data maintain a significant robustness gap when compared to the test data. However, adversarial perturbations are generated on the fly and adaptively adjusted based on the model’s robustness. The robustness gap between training and test data leads to distinct adversarial perturbations for the features in natural data. These distinct adversarial perturbations degrade feature generalization. The degradation of feature generalization, in turn, weakens the model’s test performance and widens the robustness gap between training and test data, thus forming a vicious cycle. Additionally, we provide explanations for various empirical behaviors associated with RO based on our feature generalization perspective.

To examine our feature generalization perspective, we propose two representative methods: attack strength and data augmentation. Specifically, the attack strength method adjusts adversarial perturbations on small-loss training data to verify their impact on feature generalization, and the data augmentation method adjusts model’s robustness on small-loss training data to assess the effect of the robustness gap on feature generalization. Both methods can effectively prevent feature generalization degradation during AT and mitigate RO, thereby validating our feature generalization understanding of RO. Additionally, extensive experiments conducted on benchmark datasets demonstrate the efficacy of these methods in enhancing adversarial robustness. To sum up, our contributions are as follows:

•

We conduct factor ablation AT adhering to the principle of controlling variables and identify that the inducing factor of RO stems from natural data.
•

We further hypothesize that adversarial perturbations degrade the generalization of features in natural data and validate this hypothesis through extensive experiments.
•

Based on these findings, we provide a holistic view of RO from the feature generalization perspective and explain various empirical behaviors associated with RO.
•

To examine our feature generalization perspective, we propose two representative methods to prevent the degradation of feature generalization. Extensive experiments demonstrate that the proposed methods effectively mitigate RO and enhance adversarial robustness.

2 Related Work

In this section, we provide a brief overview of related works from two perspectives: AT and RO.

2.1 Adversarial Training

Let $f_{\theta}$ , $\mathcal{X}$ and $\ell$ denote the network $f$ with model parameter $\theta$ , the input space, and the loss function, respectively. Given a $C$ -class dataset $\mathcal{S}=\{(x_{i},y_{i})\}^{n}_{i=1}$ , where $x_{i}\in\mathcal{X}$ and $y_{i}\in\mathcal{Y}=\{0,1,\dots,C-1\}$ denotes its corresponding label, the objective function of natural training is defined as follows:

\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell(f_{\theta}(x_{i}),y_{i}),

(1)

where the network $f_{\theta}$ learns features in natural data that are correlated with associated labels to minimize the empirical risk of misclassifying natural inputs. However, evidence [33, 36, 17] suggests that naturally trained models tend to learn fragile features that are vulnerable to adversarial attacks. To address this issue, AT [14, 26] introduces adversarial perturbations to each data point by transforming $\mathcal{S}=\{(x_{i},y_{i})\}^{n}_{i=1}$ into $\mathcal{S}^{\prime}=\{(x_{i}^{\prime}=x_{i}+\delta_{i},y_{i})\}^{n}_{i=1}$ . The adversarial perturbations $\{\delta_{i}\}^{n}_{i=1}$ are constrained by a pre-specified budget, i.e. $\{\delta\in\Delta:||\delta||_{p}\leq\epsilon\}$ , where $p$ can be $1,2,\infty$ , etc. Therefore, the objective of AT is to solve the following min-max optimization problem:

\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\max_{\delta_{i}\in\Delta}\ell(f_{\theta}(x_{i}+\delta_{i}),y_{i}),

(2)

where the inner maximization process aims to generate adversarial data that maximizes the classification loss, and the outer minimization process optimizes model parameter using the generated adversarial data. This iterative procedure aims to train an adversarially robust classifier. The most commonly employed approach for generating adversarial data is Projected Gradient Descent (PGD) [26], which applies adversarial attack to natural data $x_{i}$ over multiple steps $k$ with a step size of $\alpha$ :

\delta^{k}=\Pi_{\Delta}(\alpha\cdot\mathrm{sign}\nabla_{x}\ell(f_{\theta}(x+\delta^{k-1}),y)+\delta^{k-1}),k\in\mathbb{N},

(3)

where $\delta^{k}$ represents the adversarial perturbation at step $k$ , and $\Pi_{\Delta}$ denotes the projection operator.

Besides standard AT, there are also several AT variants [20, 44, 39]. One typical variant is TRADES [44], which trains the network on both natural data and adversarial data:

\min_{\theta}\sum_{i}\big{\{}\mathrm{CE}(f_{\theta}(x_{i}),y_{i})+\beta\cdot\max_{\delta_{i}\in\Delta}\mathrm{KL}(f_{\theta}(x_{i})||f_{\theta}(x_{i}+\delta_{i}))\big{\}},

(4)

where CE is the cross-entropy loss that encourages the network to maximize natural accuracy, KL is the Kullback-Leibler divergence that encourages robustness improvement, and the hyperparameter $\beta$ is used to regulate the tradeoff between natural accuracy and adversarial robustness.

Moreover, further improvements to adversarial training have also been explored, such as curriculum adversarial training [5, 38], ensemble adversarial training [35, 28], and adaptive budget schedule [24, 1, 19].

2.2 Robust Overfitting

RO was initially observed in standard AT [26]. Subsequent systematic studies by Rice et al. [30] showed that traditional remedies for overfitting in deep learning offer limited help in addressing RO in AT, prompting further efforts to explore RO. Schmidt et al. [31] attributed RO to sample complexity theory and suggested that robust generalization requires more training data, which is supported by empirical results in derivative works [6, 2, 43]. Other works also proposed various strategies to mitigate RO without relying on additional training data, such as sample reweighting [39, 45, 25], label smoothing [18], stochastic weight averaging [7], temporal ensembling [12], knowledge distillation [7], weight regularization [40, 41], and data augmentation [29, 34, 22]. Additionally, RO has been studied from various perspectives, such as label noise [11], input-loss landscape [23] and minimax game [37]. In this work, we provide a novel feature generalization understanding of RO.

3 Method

In this section, we begin with factor ablation experiments to identify the inducing factors of RO (Section 3.1). Furthermore, we propose a hypothesis that adversarial perturbations degrade feature generalization and verify this hypothesis through extensive experiments. Following this, we provide a holistic view of RO from the feature generalization perspective and explain various empirical behaviors associated with RO (Section 3.2). Lastly, we present two representative methods aimed at examining our feature generalization understanding of RO (Section 3.3).

Refer to caption — (a) Factor ablation AT

3.1 Factor Ablation Adversarial Training

Drawing inspiration from the data ablation experiments conducted by Yu et al. [41], which investigated the influence of small-loss training data on RO by excluding such data during training, we propose factor ablation adversarial training as a method to pinpoint the inducing factors of RO. Following a similar experimental setup, we also utilize a fixed loss threshold to differentiate small-loss training data. For instance, in the CIFAR10 dataset, data with an adversarial loss below 1.7 is regarded as small-loss training data. Unlike data ablation experiments that treat adversarial data as a single entity, our approach considers natural data and adversarial perturbations from small-loss training data as separate factors. We conduct more granular factor ablation adversarial training to identify the specific factor inducing RO. Specifically, we train a PreAct ResNet-18 model on CIFAR-10 dataset under the $\ell_{\infty}$ threat model and remove designated factors before RO occurs (i.e., at the 100th epoch). These experiments include: i) baseline, which is a baseline group where no factors are removed; ii) data & perturbation, which removes both natural data and adversarial perturbations from small-loss training data; and iii) perturbation, which solely removes adversarial perturbations from small-loss training data.

It’s important to note that these experiments strictly adhere to the principle of controlling variables. They are entirely identical before RO occurred, with the sole distinction among different experimental groups being the removal of designated factors from the training set. The experimental results of factor ablation adversarial training are summarized in Figure 1(a). It is observed that the data & perturbation group demonstrates a significant alleviation of RO, whereas both the baseline and perturbation groups continue to exhibit severe RO. Given that the only difference between the data & perturbation and perturbation groups lies in the inclusion of natural data in the training set, we can infer that the inducing factor of RO stems from natural data. Similar effects can be observed across different datasets, network architectures, and AT variants (see Appendix A), indicating that it is a general finding in adversarial training.

3.2 Analysis of Robust Overfitting from the Feature Generalization Perspective

Based on the factor ablation experiments in Section 3.1, we know that natural data acts as the inducing factor of RO. It’s noteworthy that natural data is also present in natural training. However, in natural training, RO phenomenon is typically absent. For instance, deep learning models trained on natural data often achieve zero training error, effectively memorizing the training data, without observable detrimental effects on generalization performance [30]. In contrast, RO is a dominant phenomenon in AT, which commonly exhibits significantly degraded generalization performance. Given that the only difference between adversarial and natural training lies in the inclusion of adversarial perturbations in the natural data, a natural hypothesis arises: adversarial perturbations may degrade the generalization of features in natural data and lead to the emergence of RO.

Verification. To validate the aforementioned hypothesis, before generating training adversarial data in AT, we introduce an additional adversarial perturbation into the training natural data and observe the model’s robust generalization performance. Specifically, before the onset of RO (i.e., between the 50th and 100th epochs), we incorporate an extra adversarial perturbation in the opposite direction into the training natural data and observe the model’s test robustness. We conduct experiments with varying budgets of additional adversarial perturbations, such as $\epsilon=\{0/255,2/255,4/255,6/255,8/255\}$ . The model’s test robustness is shown in Figure 1(b). It is important to note that the additional adversarial perturbations are introduced only on the training natural data, while the model and the test data remain unchanged, ensuring that the model’s robustness on the test data remains unaffected. However, we observe that as the additional adversarial perturbation is incorporated, the model’s test robustness deteriorates, indicating that the additional adversarial perturbations exert a detrimental effect on feature generalization. Furthermore, we linearly increase the budget of additional adversarial perturbations between the 50th and 100th epochs, and the model’s test robustness is summarized in Figure 1(c). It is observed that the RO phenomenon can be simulated by adjusting the adversarial perturbations on training data. In Appendix B, we also provide more experimental results conducted at different stages of training. These experimental results support our hypothesis that the adversarial perturbations can degrade feature generalization and induce the emergence of RO.

A holistic view of RO from the feature generalization perspective. In standard AT, there exists a robustness gap between the training and test data before the onset of RO, as shown in Figure 1(d). This robustness gap arises due to factors such as the memorization effect of deep networks [3, 12] and the distribution deviation between the finite training and test data. Certain small-loss training data maintain a significant robustness gap when compared to the test data. However, adversarial perturbations are generated on the fly and adaptively adjusted based on the model’s robustness. For instance, consider a single data point, when the model’s robustness is poor, adversarial perturbations mainly impede the model’s learning of non-robust features. Conversely, when the model’s robustness is strong, adversarial perturbation will hinder the model’s learning of robust features because the process of generating adversarial perturbation aims to maximize the classification loss. The robustness gap between training and test data leads to distinct adversarial perturbations for the features in natural data, degrading feature generalization. As feature generalization deteriorates, the model’s adversarial robustness on the test data progressively declines, and the robustness gap between training and test data continues to widen, forming a vicious cycle, as illustrated in Figure 2.

Based on the analysis above, some small-loss training data maintain a significant robustness gap compared to the test data. This can explain why removing small-loss adversarial data from the training data can effectively prevent RO. Furthermore, AT exhibits some other empirical behaviors associated with RO, for instance, as the perturbation budget increases, the extent of RO initially rises and then decreases. Our feature generalization perspective provides an intuitive explanation: as the perturbation budget increases from 0, the impact of adversarial perturbations on the generalization of features in natural data also grows, leading to a higher probability of RO phenomenon during AT. This can explain why natural training does not exhibit RO, and as the adversarial perturbation budget increases, the extent of RO also rises. However, with a further increase in the perturbation budget, the model’s robustness on the training data is limited, leading to a narrowing robustness gap between training and test data. The narrowing robustness gap reduces the generation of adversarial perturbations that degrade feature generalization. Therefore, the extent of RO gradually decreases.

3.3 The Proposed Methods

In Section 3.2, we analyze RO as a result of the degradation of Feature Generalization (ROFG), which arises from the distinct adversarial perturbations for the features in natural data. In this part, we introduce two approaches to examine our feature generalization understanding of RO.

ROFG through attack strength. In order to verify our analysis that adversarial perturbations on some small-loss training data degrade feature generalization, we conduct AT with varying attack strengths to generate adversarial perturbations on the small-loss training data before RO occurs (i.e., at the 100th epoch). Specifically, we use a fixed loss threshold $t=1.7$ to distinguish small-loss training data and train PreAct ResNet-18 on CIFAR10 under the $\ell_{\infty}$ threat model with adjusted perturbation budgets $\epsilon_{a}$ on small-loss training data, ranging from $0/255$ to $24/255$ . The pseudocode is provided in Algorithm 1, which utilizes attack strength to adjust the adversarial perturbations on small-loss training data, referred to as ROFG_AS.

For ROFG_AS with different perturbation budgets $\epsilon_{a}$ , we evaluate the model’s training and test robustness under the standard perturbation budget of $\epsilon=8/255$ , with the learning curves summarized in Figure 3(a). We observe a clear correlation between the applied perturbation budgets $\epsilon_{a}$ and the extent of RO. Specifically, the greater the perturbation budgets $\epsilon_{a}$ , the milder the extent of RO. When the perturbation budget $\epsilon_{a}$ exceeds a certain threshold, such as $16/255$ , the model exhibits almost no RO. It is worth noting that similar patterns are also observed across different datasets, network architectures, and AT variants (as shown in Appendix C). These experimental results demonstrate that adjusting adversarial perturbations on small-loss training data can effectively prevent the degradation of feature generalization, thereby alleviating the RO phenomenon, validating our feature generalization understanding of RO.

Algorithm 1 ROFG_AS

Input: Network

f_{\theta}

, training data

\mathcal{S}

, mini-batch

\mathcal{B}

, batch size

n

, learning rate

\eta

, PGD step size

\alpha

, PGD budget

\epsilon

, PGD steps

K

, adjusted PGD budget

\epsilon_{a}

, adjusted PGD steps

K_{a}

, small-loss threshold

t

Output: Adversarially robust model

f_{\theta}

repeat

Read mini-batch

(x_{\mathcal{B}},y_{\mathcal{B}})

from training set

\mathcal{S}

x_{\mathcal{B}}^{\prime}\leftarrow x_{\mathcal{B}}+\delta

, where

\delta\sim\mathrm{Uniform}(-\epsilon,\epsilon)

for

k=1

K

x_{\mathcal{B}}^{\prime}\leftarrow\Pi_{\epsilon}(x_{\mathcal{B}}^{\prime}+\alpha\cdot\mathrm{sign}(\nabla_{x_{\mathcal{B}}^{\prime}}\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})))

end for

x_{t}=x_{\mathcal{B}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

#Distinguish small-loss training data

x_{t}

y_{t}=y_{\mathcal{B}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

x_{t}^{\prime}\leftarrow x_{t}+\delta

, where

\delta\sim\mathrm{Uniform}(-\epsilon_{a},\epsilon_{a})

for

k=1

K_{a}

x_{t}^{\prime}\leftarrow\Pi_{\epsilon_{a}}(x_{t}^{\prime}+\alpha\cdot\mathrm{sign}(\nabla_{x_{t}^{\prime}}\ell(f_{\theta}(x_{t}^{\prime}),y_{t})))

#Generate adjusted adversarial perturbations for

x_{t}

end for

x_{\mathcal{B}}^{\prime}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)=x_{t}^{\prime}

#Replace the adversarial perturbations of

x_{t}

\theta\leftarrow\theta-\eta\nabla_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell(f_{\theta}({x^{\prime}}_{\mathcal{B}}^{(i)}),y_{\mathcal{B}}^{(i)})

until training converged

Algorithm 2 ROFG_DA

Input: Network

f_{\theta}

, training data

\mathcal{S}

, mini-batch

\mathcal{B}

, batch size

n

, learning rate

\eta

, PGD step size

\alpha

, PGD budget

\epsilon

, PGD steps

K

, small-loss data proportion

p

, small-loss threshold

t

Output: Adversarially robust model

f_{\theta}

repeat

Read mini-batch

(x_{\mathcal{B}},y_{\mathcal{B}})

from training set

\mathcal{S}

x_{\mathcal{B}}^{\prime}\leftarrow x_{\mathcal{B}}+\delta

, where

\delta\sim\mathrm{Uniform}(-\epsilon,\epsilon)

for

k=1

K

x_{\mathcal{B}}^{\prime}\leftarrow\Pi_{\epsilon}(x_{\mathcal{B}}^{\prime}+\alpha\cdot\mathrm{sign}(\nabla_{x_{\mathcal{B}}^{\prime}}\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})))

end for

n_{t}=\sum{\mathbb{I}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

#Count the number of small-loss training data

n_{t}/n>p

then

repeat

x_{t}=x_{\mathcal{B}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

#Distinguish small-loss training data

x_{t}

y_{t}=y_{\mathcal{B}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

x_{t}=\mathrm{DataAugmentation}(x_{t})

#Conduct data augmentation on

x_{t}

x_{t}^{\prime}\leftarrow x_{t}+\delta

, where

\delta\sim\mathrm{Uniform}(-\epsilon,\epsilon)

for

k=1

K

x_{t}^{\prime}\leftarrow\Pi_{\epsilon}(x_{t}^{\prime}+\alpha\cdot\mathrm{sign}(\nabla_{x_{t}^{\prime}}\ell(f_{\theta}(x_{t}^{\prime}),y_{t})))

end for

x_{\mathcal{B}}^{\prime}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)(\ell(f_{\theta}(x_{t}^{\prime}),y_{t})>t)=x_{t}^{\prime}(\ell(f_{\theta}(x_{t}^{\prime}),y_{t})>t)

#Update training data

n_{t}=\sum{\mathbb{I}}(\ell(f_{\theta}(x_{\mathcal{B}}^{\prime}),y_{\mathcal{B}})\leq t)

#Update the number of small-loss training data

until

n_{t}/n\leq p

end if

\theta\leftarrow\theta-\eta\nabla_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell(f_{\theta}({x^{\prime}}_{\mathcal{B}}^{(i)}),y_{\mathcal{B}}^{(i)})

until training converged

ROFG through data augmentation. To further support our analysis that the model’s robustness gap between training and test data promotes the generation of adversarial perturbations that degrades feature generalization, we employ data augmentation techniques to weaken the model’s robustness on small-loss training data before RO occurs (i.e., at the 100th epoch). However, the image transformations in data augmentation techniques are often stochastic and may not yield the desired effect. To address this issue, we iteratively apply random image transformations to the small-loss training data, consistently selecting the transformed data that does not fall within the small-loss scope. Specifically, we use a fixed loss threshold $t=1.7$ to distinguish small-loss training data, and set a threshold $p$ for the proportion of small-loss training data when train PreAct ResNet-18 on CIFAR10 under the $\ell_{\infty}$ threat model. At the beginning of each iteration, we check whether the proportion of small-loss training data meets the specified threshold $p$ . If this proportion is below the specified threshold $p$ , we apply random image transformations to the small-loss training data until the specified threshold $p$ is reached. The pseudocode is provided in Algorithm 2, where the adopted data augmentation technique is AugMix [16]. We refer to this adversarial training pipeline, empowered by the data augmentation technique, as ROFG_DA.

The learning curves of ROFG_DA with different thresholds $p$ are summarized in Figure 3(b). We observe a clear correlation between the threshold $p$ and the extent of RO. As the robustness gap between small-loss training data and test data decreases, the extent of RO becomes increasingly mild. Furthermore, the observed effects are consistent across different datasets, network architectures, and AT variants (as shown in Appendix D). These experimental results show that narrowing the model’s robustness gap between training and test data can effectively prevent the degradation of feature generalization, supporting our feature generalization understanding of RO.

4 Experiment

In this section, we validate the effectiveness of the proposed methods. In Section 4.1, we provide the ablation analysis of ROFG_AS and ROFG_DA. Section 4.2 presents the performance evaluation of ROFG_AS and ROFG_DA. In Section 4.3, we discuss the limitations of the proposed methods.

Setup. We conducted extensive experiments across different benchmark datasets (CIFAR10 and CIFAR100 [21]), network architectures (PreAct ResNet-18 [15] and Wide ResNet-34-10 [42]), and adversarial training approaches (AT [26], TRADES [44], AWP [40], and MLCAT [41]). For training, we follow the same experimental settings as outlined in Rice et al. [30]. Regarding robustness evaluation, we employ PGD-20 [26] and AutoAttack (AA) [8] as adversarial attack methods. Detailed descriptions of the experimental setup can be found in Appendix E.

4.1 Analysis of the Proposed Methods

We investigate the impacts of algorithmic components in ROFG_AS and ROFG_DA under the PreAct ResNet-18 architecture on the CIFAR10 dataset.

The impact of attack strength. We empirically verify the effectiveness of attack strength component in ROFG_AS by comparing the performances of models trained using different perturbation budgets $\epsilon_{a}$ . Specifically, we conduct ROFG_AS with the perturbation budget $\epsilon_{a}$ ranging from 0/255 to 24/255, and the results are summarized in Figure 4(a). It is evident that as $\epsilon_{a}$ increases, the robustness gap between the “Best” and “Last” checkpoints steadily decreases, demonstrating its effectiveness. For adversarial robustness, we observe a trend of initially increasing and then decreasing. This is not surprising since the attack strength component with excessively large perturbation budgets is inherently detrimental to robustness improvement under the standard perturbation budget ( $\epsilon=8/255$ ). Therefore, in practice, we need to adjust the perturbation budget $\epsilon_{a}$ to balance the mitigation of RO and robustness improvement.

The importance of loss threshold $t$ . Identifying exactly which adversarial perturbations degrade feature generalization is generally intractable. To address this, we introduce a fixed loss threshold $t$ , which is used to distinguish small-loss training data. We conduct ROFG_AS experiments with $t$ varying from 1.2 to 2.0, and the model’s robustness performances are summarized in Figure 4(b). We observe a trend of initially increasing and then decreasing robustness, which implies that adversarial perturbations degrading feature generalization are primarily concentrated in the small-loss training data, suggesting the importance of the loss threshold $t$ .

The effect of data augmentation. We further investigate the effect of the data augmentation component by conducting ROFG_DA with different thresholds $p$ . The results are summarized in Figure 4(c). It is observed that as $p$ decreases, the robustness gap between the "Best" and "Last" checkpoints steadily shrinks. Additionally, we explore the use of different data augmentation techniques to implement ROFG_DA in Appendix F. It is observed that all implementations of ROFG_DA can effectively mitigate RO, suggesting that the proposed approach is generally effective regardless of the chosen data augmentation technique. For adversarial robustness, as $p$ decreases, the model’s robustness performance initially increases and then decreases. This trend may be due to the data augmentation component weakening the model’s robustness on the training data, thereby degrading the model’s defense against strong attacks.

Table 1: Test robustbess (%) of ROFG_AS and ROFG_DA across different datasets and adversarial training methods based on the PreAct ResNet-18 architecture.

Method	Dataset	Natural		PGD-20		AA
Method	Dataset	Best	Last	Best	Last	Best	Last
AT	CIFAR10	82.31 $\pm$ 0.18	84.11 $\pm$ 0.40	52.28 $\pm$ 0.37	44.46 $\pm$ 0.14	48.09 $\pm$ 0.11	42.01 $\pm$ 0.10
ROFG_DA		82.58 $\pm$ 0.25	85.45 $\pm$ 0.24	53.95 $\pm$ 0.27	49.69 $\pm$ 0.10	48.48 $\pm$ 0.09	44.44 $\pm$ 0.18
ROFG_AS		77.68 $\pm$ 0.52	78.04 $\pm$ 0.10	56.37 $\pm$ 0.33	51.96 $\pm$ 0.08	49.37 $\pm$ 0.19	45.97 $\pm$ 0.16
AT	CIFAR100	55.14 $\pm$ 0.12	55.83 $\pm$ 0.20	28.93 $\pm$ 0.36	20.87 $\pm$ 0.33	24.53 $\pm$ 0.02	18.92 $\pm$ 0.20
ROFG_DA		55.79 $\pm$ 0.39	57.92 $\pm$ 0.50	29.40 $\pm$ 0.15	25.51 $\pm$ 0.17	24.80 $\pm$ 0.12	21.59 $\pm$ 0.03
ROFG_AS		51.02 $\pm$ 0.27	51.06 $\pm$ 0.35	30.25 $\pm$ 0.36	26.19 $\pm$ 0.31	25.63 $\pm$ 0.05	22.67 $\pm$ 0.14
TRADES	CIFAR10	81.50 $\pm$ 0.31	82.27 $\pm$ 0.40	52.92 $\pm$ 0.19	49.95 $\pm$ 0.21	48.90 $\pm$ 0.16	46.92 $\pm$ 0.09
TRADES-ROFG_DA		82.89 $\pm$ 0.25	83.28 $\pm$ 0.17	53.14 $\pm$ 0.24	52.13 $\pm$ 0.36	49.12 $\pm$ 0.13	48.41 $\pm$ 0.06
TRADES-ROFG_AS		80.92 $\pm$ 0.52	80.97 $\pm$ 0.52	53.49 $\pm$ 0.31	52.04 $\pm$ 0.25	49.88 $\pm$ 0.15	48.91 $\pm$ 0.07
AWP	CIFAR10	81.01 $\pm$ 0.52	81.61 $\pm$ 0.46	55.36 $\pm$ 0.36	55.05 $\pm$ 0.13	50.12 $\pm$ 0.07	49.85 $\pm$ 0.20
AWP-ROFG_DA		81.12 $\pm$ 0.30	81.63 $\pm$ 0.44	55.89 $\pm$ 0.22	55.32 $\pm$ 0.18	50.49 $\pm$ 0.13	50.19 $\pm$ 0.13
AWP-ROFG_AS		79.39 $\pm$ 0.14	79.85 $\pm$ 0.41	56.06 $\pm$ 0.11	55.67 $\pm$ 0.20	50.79 $\pm$ 0.10	50.45 $\pm$ 0.08
MLCAT	CIFAR10	81.70 $\pm$ 0.19	82.26 $\pm$ 0.29	58.33 $\pm$ 0.26	58.25 $\pm$ 0.12	50.54 $\pm$ 0.10	50.46 $\pm$ 0.11
MLCAT-ROFG_DA		82.06 $\pm$ 0.13	82.50 $\pm$ 0.20	58.76 $\pm$ 0.37	58.57 $\pm$ 0.22	50.61 $\pm$ 0.16	50.52 $\pm$ 0.08
MLCAT-ROFG_AS		80.64 $\pm$ 0.37	81.41 $\pm$ 0.17	58.91 $\pm$ 0.38	58.74 $\pm$ 0.17	50.88 $\pm$ 0.10	50.65 $\pm$ 0.06

Table 2: Test robustbess (%) of ROFG_AS across different datasets and adversarial training methods based on the Wide ResNet-34-10 architecture.

Method	Dataset	Natural		PGD-20		AA
Method	Dataset	Best	Last	Best	Last	Best	Last
AT	CIFAR10	85.49 $\pm$ 0.23	86.50 $\pm$ 0.33	55.40 $\pm$ 0.37	47.14 $\pm$ 0.29	52.31 $\pm$ 0.13	45.74 $\pm$ 0.16
ROFG_AS	CIFAR10	82.64 $\pm$ 0.12	82.71 $\pm$ 0.55	59.07 $\pm$ 0.21	51.18 $\pm$ 0.19	53.04 $\pm$ 0.08	47.22 $\pm$ 0.19
AT	CIFAR100	60.90 $\pm$ 0.13	59.07 $\pm$ 0.31	31.35 $\pm$ 0.25	26.03 $\pm$ 0.32	27.42 $\pm$ 0.07	24.39 $\pm$ 0.07
ROFG_AS	CIFAR100	55.78 $\pm$ 0.32	52.75 $\pm$ 0.40	33.48 $\pm$ 0.26	28.40 $\pm$ 0.25	28.86 $\pm$ 0.02	24.58 $\pm$ 0.18
TRADES	CIFAR10	84.78 $\pm$ 0.51	84.70 $\pm$ 0.13	56.25 $\pm$ 0.37	48.49 $\pm$ 0.21	53.12 $\pm$ 0.14	46.69 $\pm$ 0.15
TRADES-ROFG_AS	CIFAR10	83.36 $\pm$ 0.48	83.17 $\pm$ 0.42	57.07 $\pm$ 0.23	49.65 $\pm$ 0.20	53.79 $\pm$ 0.10	47.71 $\pm$ 0.04
AWP	CIFAR10	85.30 $\pm$ 0.12	85.39 $\pm$ 0.29	58.35 $\pm$ 0.36	57.16 $\pm$ 0.29	53.07 $\pm$ 0.09	52.49 $\pm$ 0.04
AWP-ROFG_AS	CIFAR10	84.40 $\pm$ 0.22	84.71 $\pm$ 0.50	59.56 $\pm$ 0.09	57.65 $\pm$ 0.21	54.70 $\pm$ 0.12	53.08 $\pm$ 0.17
MLCAT	CIFAR10	86.72 $\pm$ 0.16	87.32 $\pm$ 0.15	62.63 $\pm$ 0.27	61.91 $\pm$ 0.24	54.73 $\pm$ 0.10	54.61 $\pm$ 0.13
MLCAT-ROFG_AS	CIFAR10	86.46 $\pm$ 0.43	86.74 $\pm$ 0.18	63.59 $\pm$ 0.32	62.40 $\pm$ 0.11	55.38 $\pm$ 0.12	54.85 $\pm$ 0.05

4.2 Performance Evaluation

The proposed ROFG_AS and ROFG_DA methods can effectively mitigate RO, as demonstrated in Section 3.3, Appendix C, and Appendix D, validating our feature generalization perspective. Regarding adversarial robustness, we conduct experiments under the PreAct ResNet-18 architecture across different datasets and adversarial training methods. The performance of the “Best” and “Last” checkpoints is provided in Table 1. It is observed that for methods exhibiting RO, such as AT [26] and TRADES [44], the proposed methods significantly improve model robustness. For methods already mitigating RO, such as AWP [40] and MLCAT [41], our methods still can achieve some complementary robustness improvement, demonstrating the significance of our feature generalization perspective. In Table 1, we can see that ROFG_AS outperforms ROFG_DA under AA in all settings. Thus we further conduct experiments using ROFG_AS under the Wide ResNet-34-10 architecture, with the results shown in Table 2. It is observed that under the stronger backbone network, the ROFG_AS method significantly enhances model robustness across different datasets and adversarial training methods. Additionally, we conduct additional experiments using ROFG_AS on the Tiny-ImageNet dataset [13] and VGG-16 architecture [32], with the results shown in Appendix G. Consistently, the proposed approach mitigates RO and boosts adversarial robustness, demonstrating its effectiveness.

4.3 Limitations

While the proposed methods contribute to validating our feature generalization perspective, they have certain limitations. Firstly, they introduce additional components into the standard AT pipeline, thereby increasing computational complexity. For instance, under a single GeForce RTX 3080 GPU, the average training time per epoch in standard AT is 147.7s, while for ROFG_AS and ROFG_DA, it extends to 261.6s and 1222.9s, respectively. Moreover, our methods have yet to achieve state-of-the-art adversarial robustness, and they often cannot simultaneously achieve optimal robustness improvement and RO mitigation, as discussed in Section 4.1. However, it is worth noting that the main focus of this work is to provide a novel feature generalization understanding of RO. These methods are specifically tailored to verify our feature generalization perspective, and these limitations essentially do not compromise our understanding of RO.

5 Conclusion

This paper investigate RO from a novel feature generalization perspective. First, we conduct factor ablation adversarial training and identify that the inducing factor of RO stems from natural data. Furthermore, we hypothesize adversarial perturbations degrade the generalization of features in natural data and validate this hypothesis through extensive experiments. Based on these findings, we provide a holistic view of RO from the feature generalization perspective and explain various empirical behaviors associated with RO. We also design two representative methods to examine our feature generalization perspective: attack strength and data augmentation. Extensive experiments demonstrate that the proposed methods effectively mitigate RO and can enhance adversarial robustness across different adversarial training methods, network architectures, and benchmark datasets.

References

[1] S. Addepalli, S. Jain, G. Sriramanan, and R. Venkatesh Babu. Scaling adversarial training to large perturbation bounds. In European Conference on Computer Vision, pages 301–316. Springer, 2022.
[2] J.-B. Alayrac, J. Uesato, P.-S. Huang, A. Fawzi, R. Stanforth, and P. Kohli. Are labels required for improving adversarial robustness? Advances in Neural Information Processing Systems, 32, 2019.
[3] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
[4] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274–283. PMLR, 2018.
[5] Q.-Z. Cai, C. Liu, and D. Song. Curriculum adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3740–3747, 2018.
[6] Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi, and P. S. Liang. Unlabeled data improves adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.
[7] T. Chen, Z. Zhang, S. Liu, S. Chang, and Z. Wang. Robust overfitting may be mitigated by properly learned smoothening. In International Conference on Learning Representations, 2020.
[8] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning. PMLR, 2020.
[9] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
[10] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
[11] C. Dong, L. Liu, and J. Shang. Label noise in adversarial training: A novel perspective to study robust overfitting. Advances in Neural Information Processing Systems, 35:17556–17567, 2022.
[12] Y. Dong, K. Xu, X. Yang, T. Pang, Z. Deng, H. Su, and J. Zhu. Exploring memorization in adversarial training. In International Conference on Learning Representations, 2022.
[13] L. Fei-Fei, R. Krishna, and A. Karpathy. Tiny imagenet visual recognition challenge. CS231n Convolutional Neural Networks for Visual Recognition, Stanford University.
[14] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.
[17] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
[18] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
[19] X. Jia, Y. Zhang, B. Wu, K. Ma, J. Wang, and X. Cao. Las-at: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13398–13408, 2022.
[20] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
[21] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[22] L. Li and M. Spratling. Data augmentation alone can improve adversarial training. arXiv preprint arXiv:2301.09879, 2023.
[23] L. Li and M. Spratling. Understanding and combating robust overfitting via input loss landscape analysis and regularization. Pattern Recognition, 136:109229, 2023.
[24] C. Liu, M. Salzmann, T. Lin, R. Tomioka, and S. Süsstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. Advances in Neural Information Processing Systems, 33:21476–21487, 2020.
[25] F. Liu, B. Han, T. Liu, C. Gong, G. Niu, M. Zhou, M. Sugiyama, et al. Probabilistic margins for instance reweighting in adversarial training. Advances in Neural Information Processing Systems, 34:23258–23269, 2021.
[26] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
[27] S. G. Müller and F. Hutter. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 774–782, 2021.
[28] T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pages 4970–4979. PMLR, 2019.
[29] S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann. Data augmentation can improve robustness. Advances in Neural Information Processing Systems, 34, 2021.
[30] L. Rice, E. Wong, and Z. Kolter. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pages 8093–8104. PMLR, 2020.
[31] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. Advances in Neural Information Processing Systems, 31, 2018.
[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
[34] J. Tack, S. Yu, J. Jeong, M. Kim, S. J. Hwang, and J. Shin. Consistency regularization for adversarial robustness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8414–8422, 2022.
[35] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
[36] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2018.
[37] Y. Wang, L. Li, J. Yang, Z. Lin, and Y. Wang. Balance, imbalance, and rebalance: Understanding robust overfitting from a minimax game perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[38] Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595. PMLR, 2019.
[39] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2019.
[40] D. Wu, S.-T. Xia, and Y. Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958–2969, 2020.
[41] C. Yu, B. Han, L. Shen, J. Yu, C. Gong, M. Gong, and T. Liu. Understanding robust overfitting of adversarial training and beyond. In International Conference on Machine Learning, pages 25595–25610. PMLR, 2022.
[42] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
[43] R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang. Adversarially robust generalization just requires more unlabeled data. arXiv preprint arXiv:1906.00555, 2019.
[44] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pages 7472–7482. PMLR, 2019.
[45] J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiyama, and M. Kankanhalli. Geometry-aware instance-reweighted adversarial training. In International Conference on Learning Representations, 2020.

Appendix A Additional Results for Factor Ablation Adversarial Training

In this section, we present additional results regarding factor ablation adversarial training across different datasets, network architectures, and AT variants. We utilize the same experimental setup as described in Section 3.1, employing a fixed loss threshold of 1.7 for the CIFAR10 dataset and 4.0 for the CIFAR100 dataset to distinguish small-loss training data. In the data & perturbation group, we remove both natural data and adversarial perturbations from small-loss training data. In the perturbation group, we solely remove adversarial perturbations from small-loss training data. The experimental results are summarized in Figure 5. It is observed that the data & perturbation group consistently exhibits significant relief from RO, while the perturbation group shows severe RO. Since the only difference between the data & perturbation and perturbation groups is the inclusion of natural data in the training set, these experimental results suggest that the inducing factor of RO stems from natural data.

Appendix B Additional Results for Verification and Inducing Experiments

In this section, we present additional results from verification and inducing experiments conducted at different stages of training. We utilize the same experimental setup as described in Section 3.2, which involves incorporating an additional adversarial perturbation in the opposite direction into the training natural data and observing the model’s test robustness. Specifically, we conduct experiments with varying budgets of additional adversarial perturbations at the 25th and 75th epochs, such as $\epsilon=\{0/255,2/255,4/255,6/255,8/255\}$ . The model’s test robustness is shown in Figure 6(a) and (b). It is observed that as the additional adversarial perturbation is incorporated, the model’s test robustness deteriorates at different stages of training, indicating that the additional adversarial perturbations exert a detrimental effect on feature generalization. It is worth noting that if we replace the additional adversarial perturbations with random noise of the same budget, the model does not exhibit the corresponding robust generalization degradation. This indicates that it is not taken for granted that adding extra perturbations to the training natural data will degrade the model’s test robustness. Furthermore, we linearly increase the budget of additional adversarial perturbations at the 25th and 75th epochs, and the model’s test robustness is summarized in Figure 6(c) and (d). We observe that the RO phenomenon can be simulated by adjusting the adversarial perturbations on training data. These experimental results support our hypothesis that the adversarial perturbations can degrade feature generalization and induce the emergence of RO.

Appendix C Additional Results for ROFG_AS

In this section, we present additional results for ROFG_AS. We utilize a fixed loss threshold of 1.7 for the CIFAR10 dataset and 4.0 for the CIFAR100 dataset to distinguish small-loss training data. We conducted ROFG_AS with varying perturbation budgets $\epsilon_{a}$ across different datasets, network architectures, and AT variants. The results are summarized in Figure 7 and 8. Consistently, we observe a clear correlation between the applied perturbation budgets $\epsilon_{a}$ and the extent of RO. As the perturbation budgets $\epsilon_{a}$ increases, the robustness gap between the “Best” and “Last” models decreases. These experimental results suggest that adjusting adversarial perturbations on small-loss training data can effectively prevent the degradation of feature generalization and mitigate RO, validating our feature generalization understanding of RO.

Appendix D Additional Results for ROFG_DA

In this section, we provide additional results of ROFG_DA to support our feature generalization perspective. We conducted ROFG_DA with different thresholds $p$ across different datasets, network architectures, and AT variants. The results are summarized in Figure 9 and 10. Consistently, we observe a clear correlation between the threshold $p$ and the extent of RO. As the robustness gap between small-loss training data and test data decreases, the extent of RO becomes increasingly mild. These experimental results show that narrowing the model’s robustness gap between training and test data can effectively prevent the degradation of feature generalization and mitigate RO, supporting our feature generalization understanding of RO.

Appendix E Experimental Setup

Our project is implemented using the PyTorch framework on servers equipped with four GeForce RTX 3080 and one Tesla V100 GPUs. The code and related models will be publicly released for verification and use. We conducted extensive experiments across different benchmark datasets (CIFAR10 and CIFAR100 [21]), network architectures (PreAct ResNet-18 [15] and Wide ResNet-34-10 [42]), and adversarial training approaches (AT [26], TRADES [44], AWP [40], and MLCAT [41]). Throughout our experiments, we utilize the infinity norm as the adversarial perturbation constraint.

For training, we follow the experimental settings outlined in Rice et al.[30]. The network undergoes 200 epochs of training using stochastic gradient descent (SGD) with a momentum of 0.9, weight decay of $5\times 10^{-4}$ , and an initial learning rate of 0.1. The learning rate is reduced by a factor of 10 at the 100th and 150th epochs. Standard data augmentation techniques, including random cropping with 4 pixels of padding and random horizontal flips, are applied.

For the adversarial setting, we use the 10-step PGD with a perturbation budget of $\epsilon=8/255$ and a step size of $\alpha=2/255$ , following the standard setting of PGD-AT [26]. In ROFG_AS, the number of PGD steps linearly increases with the adjusted perturbation budget $\epsilon_{a}$ . For instance, a 10-step PGD is utilized for $\epsilon_{a}=8/255$ , while a 20-step PGD is applied for $\epsilon_{a}=16/255$ .

For performance evaluation, we present the model’s learning curves to demonstrate their effectiveness in mitigating RO. The model’s robustness is assessed using multiple attack methods, including 20-step PGD (PGD-20) [26] and AutoAttack (AA) [8]. We report the results for both the “Best” and “Last” checkpoints, along with the standard deviations from 5 runs. The configurations for ROFG_AS and ROFG_DA are detailed in Table 3, and other hyperparameters for the baseline methods are set as per their original papers.

Table 3: The experimental configurations of ROFG_AS and ROFG_DA.

Network	Dataset	Method	Hyperparameter			Training time per epoch
Network	Dataset	Method	$t$	$\epsilon_{a}$	$p$	GeForce RTX 3080	Tesla V100
PreAct ResNet-18	CIFAR10	ROFG_DA	1.7	-	0.6	1222.9s	-
	CIFAR10	ROFG_AS	1.7	14/255	-	261.6s	-
	CIFAR100	ROFG_DA	3.5	-	0.6	1264.3s	-
	CIFAR100	ROFG_AS	3.5	15/255	-	274.0s	-
	CIFAR10	TRADES-ROFG_DA	1.9	-	0.8	-	-
	CIFAR10	TRADES-ROFG_AS	1.9	10/255	-	-	-
	CIFAR10	AWP-ROFG_DA	0.8	-	0.8	219.7s	-
	CIFAR10	AWP-ROFG_AS	0.8	12/255	-	247.0s	-
	CIFAR10	MLCAT-ROFG_DA	0.7	-	0.8	354.4s	-
	CIFAR10	MLCAT-ROFG_AS	0.7	11/255	-	387.4s	-
Wide ResNet-34-10	CIFAR10	ROFG_AS	1.4	13/255	-	-	1872.1s
	CIFAR100	ROFG_AS	4.2	16/255	-	-	2053.9s
	CIFAR10	TRADES-ROFG_AS	1.9	10/255	-	-	-
	CIFAR10	AWP-ROFG_AS	1.1	10/255	-	-	1794.0s
	CIFAR10	MLCAT-ROFG_AS	0.8	9/255	-	-	2650.9s

Appendix F The Impact of Data Augmentation Techniques in ROFG_DA

In this section, we explore the use of different data augmentation techniques to implement ROFG_DA. Specifically, we conduct experiments with ROFG_DA using three popular data augmentation techniques: AutoAugment [9], RandAugment [10], and TrivialAugment [27]. The results are summarized in Figure 11. We observe that the model’s test robustness exhibits considerable oscillation, likely due to the significant changes in input images caused by these data augmentation techniques. Nevertheless, all implementations of ROFG_DA effectively mitigate RO by narrowing the robustness gap between small-loss training data and test data. These experimental results suggest that the proposed ROFG_DA is generally effective regardless of the chosen data augmentation techniques.

Appendix G Experimental Results on More Dataset and Network Architecture

In this section, we conduct experiments with ROFG_AS on additional dataset and network architecture, such as VGG-16 [32] and Tiny-ImageNet [13]. For the VGG-16 architecture, we set the hyperparameters of ROFG_AS to $\epsilon_{a}=12/255$ and $t=1.8$ . For the Tiny-ImageNet dataset, the hyperparameters of ROFG_AS are set to $\epsilon_{a}=16/255$ and $t=7.0$ . As demonstrated by the results in Table 4, the proposed approach consistently mitigates RO and enhances adversarial robustness, demonstrating its effectiveness.

Table 4: Test robustbess (%) of ROFG_AS on Tiny-ImageNet dataset and VGG-16 architecture.

Method	Dataset	Network	Natural		PGD-20		AA
Method	Dataset	Network	Best	Last	Best	Last	Best	Last
AT	CIFAR10	VGG-16	78.84 $\pm$ 0.36	81.89 $\pm$ 0.13	49.63 $\pm$ 0.21	43.94 $\pm$ 0.34	43.70 $\pm$ 0.13	39.95 $\pm$ 0.08
ROFG_AS	CIFAR10	VGG-16	75.71 $\pm$ 0.14	76.93 $\pm$ 0.38	53.06 $\pm$ 0.11	50.62 $\pm$ 0.09	46.09 $\pm$ 0.18	43.87 $\pm$ 0.15
AT	Tiny-ImageNet	PreAct ResNet-18	45.51 $\pm$ 0.54	45.94 $\pm$ 0.11	21.34 $\pm$ 0.25	15.64 $\pm$ 0.13	17.10 $\pm$ 0.10	13.21 $\pm$ 0.13
ROFG_AS	Tiny-ImageNet	PreAct ResNet-18	39.03 $\pm$ 0.18	39.06 $\pm$ 0.23	22.20 $\pm$ 0.24	20.48 $\pm$ 0.21	17.91 $\pm$ 0.02	16.45 $\pm$ 0.07

Understanding Robust Overfitting from the Feature Generalization Perspective

Abstract

1 Introduction

2 Related Work

2.1 Adversarial Training

2.2 Robust Overfitting

3 Method

3.1 Factor Ablation Adversarial Training

3.2 Analysis of Robust Overfitting from the Feature Generalization Perspective

3.3 The Proposed Methods

4 Experiment

4.1 Analysis of the Proposed Methods

4.2 Performance Evaluation

4.3 Limitations

5 Conclusion

References

Appendix A Additional Results for Factor Ablation Adversarial Training

Appendix B Additional Results for Verification and Inducing Experiments

Appendix C Additional Results for ROFGAS

Appendix D Additional Results for ROFGDA

Appendix E Experimental Setup

Appendix F The Impact of Data Augmentation Techniques in ROFGDA

Appendix G Experimental Results on More Dataset and Network Architecture

Appendix C Additional Results for ROFG_AS

Appendix D Additional Results for ROFG_DA

Appendix F The Impact of Data Augmentation Techniques in ROFG_DA