Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training
Abstract
Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with separated training data points. First, we prove that, based on the assumption that we assume there is -size clean classifier (where is the data dimension), ReLU net with only extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.
1 Introduction
Nowadays, deep neural networks have achieved excellent performance in a variety of disciplines, especially including in computer vision (Krizhevsky et al., 2012; Dosovitskiy et al., 2020; Kirillov et al., 2023) and natural language process (Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022). However, it is well-known that small but adversarial perturbations to the natural data can make well-trained classifiers confused (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014), which potentially gives rise to reliability and security problems in real-world applications and promotes designing adversarial robust learning algorithms.
In practice, adversarial training methods (Goodfellow et al., 2014; Madry et al., 2017; Shafahi et al., 2019; Zhang et al., 2019; Pang et al., 2022) are widely used to improve the robustness of models by regarding perturbed data as training data. However, while these robust learning algorithms are able to achieve high robust training accuracy (Gao et al., 2019), it still leads to a non-negligible robust generalization gap (Raghunathan et al., 2019), which is also called robust overfitting (Rice et al., 2020; Yu et al., 2022).
To explain this puzzling phenomenon, a series of works have attempted to provide theoretical understandings from different perspectives. Despite these aforementioned works seem to provide a series of convincing evidence from theoretical views in different settings, there still exists a gap between theory and practice for at least two reasons.
First, although previous works have shown that adversarial robust generalization requires more data and larger models (Schmidt et al., 2018; Gowal et al., 2021; Li et al., 2022; Bubeck & Sellke, 2023), it is unclear that what mechanism, during adversarial training process, directly causes robust overfitting. A line of work about uniform algorithmic stability (Xing et al., 2021; Xiao et al., 2022), under Lipschitzian smoothness assumptions, also suggest that robust generalization gap increases when training iteration is large. In other words, we know there is no robust generalization gap for a trivial model that only guesses labels totally randomly (e.g. deep neural networks at random initialization), which implies that we should take learning process into consideration to analyze robust generalization.
Second and most importantly, while some works (Tsipras et al., 2018; Zhang et al., 2019; Hassani & Javanmard, 2022) point out that achieving robustness may hurt clean test accuracy, in most of the cases, it is observed that drop of robust test accuracy is much higher than drop of clean test accuracy in adversarial training (Madry et al., 2017; Schmidt et al., 2018; Raghunathan et al., 2019) (see in Figure 1, where clean test accuracy is more than but robust test accuracy only attains nearly ). Namely, a weak version of benign overfitting (Zhang et al., 2017), which means that overparameterized deep neural networks can both fit random data powerfully and generalize well for unseen clean data, remains after adversarial training.
Therefore, it is natural to ask the following question:
What is the underlying mechanism that results in both Clean Generalization and Robust Overfitting (CGRO) during adversarial training?
In this paper, we provide a theoretical understanding of this question. Precisely, we make the following contributions:
- •
-
•
In Section 4, we study the CGRO classifiers via the view of representation complexity. Based on some data assumptions observed in practice, we prove that achieving CGRO classifier only needs extra linear parameters by leveraging robust memorization (Theorem 4.4), but robust classifier requires even exponential model capacity in worst case (Theorem 4.7).
-
•
In Section 5, under our theoretical framework of adversarial training, we apply a two-layer convolutional network to learn the structured data. We propose a three-stage analysis technique to decouple the complicated training dynamics of adversarial training, which shows that the network learner provably converges to the CGRO regime (Theorem 5.9).
- •
2 Additional Related Work
Adversarial Examples and Adversarial Training Method. Szegedy et al. (2013) first made an intriguing discovery: even state-of-the-art deep neural networks are vulnerable to adversarial examples. Then, Goodfellow et al. (2014) proposes FGSM to seek adversarial examples, and it also introduce the adversarial training framework to enhance the defence of models against adversarial perturbations.
Empirical Works on Robust Overfitting. One surprising behavior of deep learning is that over-parameterized neural networks can generalize well, which is also called benign overfitting that deep models have not only the powerful memorization but a good performance for unseen data (Zhang et al., 2017; Belkin et al., 2019). However, in contrast to the standard (non-robust) generalization, for the robust setting, Rice et al. (2020) empirically investigates robust performance of models based on adversarial training methods, which are used to improve adversarial robustness (Szegedy et al., 2013; Madry et al., 2017), and the work Rice et al. (2020) shows that robust overfitting can be observed on multiple datasets, including CIFAR10 and ImageNet.
Theoretical Works on Robust Overfitting. A list of works (Schmidt et al., 2018; Balaji et al., 2019; Dan et al., 2020) study the sample complexity for adversarial robustness, and their works manifest that adversarial robust generalization requires more data than the standard setting, which gives an explanation of the robust generalization gap from the perspective of statistical learning theory. And another line of works (Tsipras et al., 2018; Zhang et al., 2019) propose a principle called the robustness-accuracy trade-off and have theoretically proven the principle in different setting, which mainly explains the widely observed drop of robust test accuracy due to the trade-off between adversarial robustness and clean accuracy. Recently, Li et al. (2022) investigates the robust expressive ability of neural networks and shows that robust generalization requires exponentially large models.
Feature Learning Theory of Deep Learning. The feature learning theory of neural networks (Allen-Zhu & Li, 2020a, b, 2022; Shen et al., 2022; Jelassi & Li, 2022; Jelassi et al., 2022; Chen et al., 2022; Chidambaram et al., 2023) is proposed to study how features are learned in deep learning tasks, which provide a theoretical analysis paradigm beyond the neural tangent kernel (NTK) theory (Jacot et al., 2018; Du et al., 2018, 2019; Allen-Zhu et al., 2019; Arora et al., 2019). In this work, we make a first step to understand clean generalization and robust overfitting (CGRO) phenomenon in adversarial training by analyzing feature learning process under our theoretical framework about structured data.
3 Preliminaries
In this section, we first introduce some useful notations that are used in this paper, and then present our problem setup.
3.1 Notations
Throughout this work, we use letters for scalars and bold letters for vectors. We will use to indicate the index set . The indicator of an event is defined as if the event holds and otherwise. We use to denote the sign function of the real number , and we use to denote the linear span of the vectors .
For any given two sequences and , we denote if there exist some absolute constant and such that for all . Similarly, we denote if there exist and such that for all . We say if and both holds. We use , and to hide logarithmic factors in these notations respectively. Moreover, we denote if for some positive constant , and if . We also say if for arbitrary positive constant , there exists such that for all . An event is said that it happens with high probability (or w.h.p. for short) if it happens with probability at least .
We use the notation to denote the norm in the vector space . For two sets , we can define the -distance between and as . For , is defined as the -ball with radius centered at .
A multi-layer neural network is a function from input in to output in , which is defined as follows:
and
where the weight and the bias are learnable parameters, and and are the width and depth of the neural network, respectively. We use to denote the (non-linear) entry-wise activation function, and ReLU activation function is defined as .
3.2 Problem Setup
We consider a binary classification setting, where we use to denote the data input and the binary label is in . Given a data distribution that is a joint distribution over the supporting set , and a function as the classifier, we can define the following measurements to describe the clean (robust) classification performance of the classifier on data .
Definition 3.1.
(Clean Test Error) The clean test error of the classifier w.r.t. the data distribution is defined as
Definition 3.2.
(Robust Test Error) Given a -robust radius , the robust test error of the classifier w.r.t. the data distribution and under norm is defined as
In our work, we mainly focus on the cases when and , which can be extended to the general case as well.
In adversarial training, with access to the training dataset randomly sampled from the data distribution , we aim to minimize the following robust training error to derive the robust classifier.
Definition 3.3.
(Robust Training Error) Given a -robust radius , the robust training error of the classifier w.r.t. training dataset and under norm is defined as .
Now, we present the concept CGRO classifier that we mainly study in this paper as the following definition.
Definition 3.4.
(CGRO Classifier) Given a -robust radius , we say a classifier is CGRO classifier w.r.t. the data distribution and training dataset if it satisfies that , but .
Remark 3.5.
In the above definition of CGRO classifier, the asymptotic notations and are both defined with respective to the data dimension , which means that CGRO classifier has a good clean test performance but a poor robust generalization at the same time.
4 Analyzing the CGRO Phenomenon from the View of Representation Complexity
In this section, we provide a theoretical understanding of the CGRO phenomenon from the view of representation complexity. First, We present the data assumption as follow.
For the data distribution and -robust radius , the supporting set of the data input can be divided into two sets and that correspond to the positive and negative classes respectively, i.e. and .
Assumption 4.1.
(Bounded) There exists a absolute constant such that, with high probability over the data distribution , it holds that the data input .
Recall the definition of CGRO classifier (Definition 3.4). W.l.o.g., we can only focus on the case .
Assumption 4.2.
(Well-Separated) we assume that the separation between the positive and negative classes is large than twice the robust radius, i.e. .
This assumption is necessary to ensure the existence of a robust classifier, and which is indeed empirically observed in real-world image classification data (Yang et al., 2020).
Assumption 4.3.
(Polynomial-Size Clean Classifier Exists) we assume that there exists a clean classifier represented as a ReLU network with parameters such that but .
In Assumption 4.3, the asymptotic notations and are defined w.r.t. the data dimension . It means that the clean classifier with moderate size can perfectly classify the natural data but fails to classify the adversarially perturbed data, which is consistent with the common practice that well-trained neural networks are vulnerable to adversarial examples (Szegedy et al., 2013; Raghunathan et al., 2019).
Now, we present our main result about the upper bound of representation complexity for CGRO classifier as follow.
Theorem 4.4.
Proof Sketch. First, we show the existence of the CGRO classifier w.r.t the data distribution and training dataset .
Lemma 4.5.
Due to Assumption 4.2, it is clear that we have for any distinct . Thus, perfectly clean classifies unseen test data by the first summand (Assumption 4.3) and robustly memorizes training data by the second summand, which belongs to CGRO classifiers.
Back to the proof of Theorem 4.4, the key idea is to approximate the classifier proposed in Lemma 4.5 by ReLU network. Indeed, the function can be rewritten as
Therefore, we use ReLU nets to approximate the distance function in efficiently, and we also noticed that the indicator can be approximated by a soft indicator represented by two ReLU neurons. Combined with the above results, there exists a ReLU net with parameters such that , which implies Theorem 4.4.
Remark 4.6.
However, to achieve robust generalization, higher complexity seems necessary. We generalize the result in Li et al. (2022) from linear-separable setting to Assumption 4.3.
Theorem 4.7.
Proof Sketch. The main proof idea of Theorem 4.7 is the linear region decomposition of ReLU nets (Montufar et al., 2014), and then we apply the technique bounding the VC dimension for local region similar to Li et al. (2022).
According to Assumption 4.3, Theorem 4.4 and Theorem 4.7, we obtain the following inequalities about representation complexity of classifiers in different regimes.
Remark 4.8.
These inequalities states that while CGRO classifiers have mildly higher representation complexity than clean classifiers, adversarial robustness requires excessively higher complexity, which may lead the classifier trained by adversarial training to the CGRO regime. We also empirically verify this by the experiments of adversarial training regarding different model sizes in Section 6.
5 Analyzing Dynamics of Adversarial Training on Structured Data
In the previous section, we show the efficiency of CGRO classifier via representation complexity. However, it is unclear how the classifier trained by adversarial training converges to the CGRO regime. In this section, we will focus on a structured-data setting to study the learning process of adversarial training. First, we introduce our structured-data setting as follow, and then provide a fine-grained analysis of adversarial training under our theoretical framework.
5.1 Structured Data Setting
In this section, we consider the case that data input has a patch structure as follow, which is similar to a list of theoretical works about feature learning (Allen-Zhu & Li, 2020b; Chen et al., 2021; Jelassi & Li, 2022; Jelassi et al., 2022; Kou et al., 2023; Chidambaram et al., 2023).
Definition 5.1.
(Patch Data Distribution) We define a data distribution , in which each instance consists in an input and a label generated by
-
1.
The label is uniformly drawn from .
-
2.
The input , where each patch and is the number of patches (we assume that is an integer and ).
-
3.
Meaningful Signal patch: for each instance, there exists one and only one meaningful patch satisfies , where is the unit meaningful signal vector and is the norm of signal.
-
4.
Noisy patches: , for , where denotes the variance of noise .
Remark 5.2.
The patch structure that we leverage can be viewed as a simplification of real-world vision-recognition datasets. Indeed, images are divided into signal patches that are meaningful for the classification such as the whisker of a cat or the nose of a dog, and noisy patches like the uninformative background of a photo. And our patch data assumption can also be generalized to the case that there exist a set of patches that are meaningful, while analyzing the learning process will be too complicated to un-clarify our main idea that we want to present. Therefore, we focus on the single meaningful patch case in our work.
To learn our structured data, we use two-layer convolutional neural network (CNN) (LeCun et al., 1998; Krizhevsky et al., 2012) with non-linear activation as the learner network.
Definition 5.3.
(Two-layer Convolutional Network) For a given input data , our network learner is defined as
where are learnable parameters, and is the activation function defined as .
activation functions are useful to address the non-smoothness of ReLU function at zero, which are widely applied in literatures of deep learning theory (Kileel et al., 2019; Allen-Zhu & Li, 2020a, b; Chen et al., 2021; Jelassi & Li, 2022; Jelassi et al., 2022). To simplify our analysis, we fix the second layer weights as . We also assume that is odd and it holds that during optimization process. At initialization, we sample i.i.d from . Our results can be extended to the case when is even and are differently.
The Role of Non-linearity. Indeed, a series of recent theoretical works (Li et al., 2019; Chen et al., 2021; Javanmard & Soltanolkotabi, 2022) show that linear model can achieve robust generalization for adversarial training under certain settings, which but fails to explain the CGRO phenomenon observed in practice. To mitigate this gap, we improve the expressive power of model by using non-linear activation that can characterize the data structure and learning process more precisely.
In adversarial training, we aim to minimize the following adversarial training loss, which is a trade-off between natural risk and robust regularization defined as follow.
Definition 5.4.
(Adversarial Training Loss) For a hyperparameter and the -robust radius , the adversarial training loss of the network w.r.t. the training dataset is defined as
where we use to denote the single-point loss with respect to on and denotes the perturbed range of the data point with -radius .
Remark 5.5.
Here, we use the logistic loss defined as . And we apply the perturbed range defined as , where satisfies that
This perturbed range ensures that adversarial perturbations used to generate adversarial examples are always aligned with the meaningful signal vector during adversarial training, which exactly simplifies the optimization analysis.
Definition 5.6.
(Adversarial Training Algorithm) We run the standard gradient descent method to update the network parameters for iterations w.r.t. the adversarial training loss that can be rewritten as , where denotes the adversarial example generated by -th training data at -th iteration. Concretely, the adversarial examples and network parameters are updated alternatively as
where is the learning rate.
Next, we make the following assumptions about hyperparameters we introduced in our structured-data setting.
Assumption 5.7.
(Choice of Hyperparameters) We assume that:
where are constants satisfying .
Discussion of Hyperparameter Choices. Actually, the choices of these hyperparameters are not unique. We make concrete choices above for the sake of calculations in our proofs, but only the relationships between them are important. Namely, since the norm of signal patch is and the norm of noise patch w.h.p. is , our choices ensure that meaningful signal is stronger than noise. Without this assumption, in other word, if the signal-to-noise ratio is very low, there even exists no clean generalization, which has been theoretically shown under the similar patch-structured data setting (Cao et al., 2021; Chen et al., 2021; Frei et al., 2022). The width of network learner is to achieve mildly over-parameterization we mentioned in Theorem 4.4. The separation in Assumption 4.2 also holds due to .
5.2 Main Results
First, we introduce the concept called feature learning to characterize what the model learns from training dynamics.
Definition 5.8.
(Feature Learning) Specifically, given a certain training data-point and the network learner , we focus on the following two types of feature learning.
-
•
True Feature Learning. We project the weight on the meaningful signal vector to measure the correlation between the model and the true feature as
-
•
Spurious Feature Learning. We project the weight on the random noise to measure the correlation between the model and the spurious feature as
Thus, it holds that the model correctly classify the clean data in two cases. One is that the model learns the true feature and ignores the spurious features, i.e. . Another is that the model doesn’t learn the true feature but memorizes the spurious features, i.e. and . We can analyze the perturbed data similarly.
Now, we give our main result about feature learning process.
Theorem 5.9.
Under Assumption 5.7, we run the adversarial training algorithm in Definition 5.6 to update the weight of the network learner for iterations. Then, with high probability, it holds that the network leanrer
-
1.
partially learns the true feature, i.e. ;
-
2.
exactly memorizes the spurious feature, i.e. for each ,
where and is defined for th instance and th iteration as the same in Definition 5.8. Consequently, the clean test error and robust training error are both smaller than , but the robust test error is at least , which means that is a CGRO classifier.
Remark 5.10.
Theorem 5.9 states that, during adversarial training, the neural network partially learns the true feature of objective classes and exactly memorizes the spurious features depending on specific training data, which causes that the network learner is able to correctly classify clean data by partial meaningful signal (clean generalization), but fails to classify the unseen perturbed data since it leverages only data-wise random noise to memorize training adversarial examples (robust overfitting). We also conduct numerical simulation experiments to confirm our results in Section 6.



5.3 Analysis of Learning Process
Next, we provide a proof sketch of Theorem 5.9. To obtain a detailed analysis of learning process, we consider the following objects that can be viewed as weight-wise version of and . We define and as
for each , and .
We then propose a three-stage analysis technique to decouple the complicated feature learning process as follows.
Phase I: At the beginning, the signal component of lottery tickets winner increases quadratically (Lemma 5.11).
Lemma 5.11.
Lemma 5.11 manifests that the signal component increases quadratically at initialization. Therefore, we know that, after iterations, the signal component attains the order , which implies the model learns partial true feature at this point.
Phase II: Once the signal component attains the order , the growth of signal component nearly stop updating since that the increment of signal component is now mostly dominated by the noise component (Lemma 5.12).
Lemma 5.12.
(Upper Bound of Signal Component Growth) For and any , the signal component is upper bounded as
Lemma 5.12 shows that, after partial true feature learning, the increment of signal component is mostly dominated by the noise component , which implies that the growth of signal component will converge when .
Phase III: After that, by the quadratic increment of noise component (Lemma 5.13), the total noise eventually attains the order , which implies the model memorizes the spurious feature (data-wise noise) in final.
Lemma 5.13.
(Lower Bound of Noise Component Growth) For each , and and any , the noise component grows as
The practical implication of Lemma 5.13 is two-fold. First, by the quadratic increment of noise component, we derive that, after iterations, the total noise memorization attains the order , which suggests that the model is able to robustly classify adversarial examples by memorizing the data-wise noise. Second, combined with Lemma 5.12, the signal component will maintain the order , which immediately implies the main conclusion of Theorem 5.9. And the full detailed proof of Theorem 5.9 can be see in Appendix E.
6 Experiments
Dataset | MNIST | CIFAR10 | |||||||
---|---|---|---|---|---|---|---|---|---|
Model Size | 1 | 4 | 8 | 12 | 16 | 1 | 2 | 5 | 10 |
Clean Test | 11.35 | 11.35 | 11.35 | 95.06 | 94.85 | 82.56 | 84.92 | 85.83 | 86.05 |
Robust Test | 11.35 | 11.35 | 11.35 | 77.96 | 83.43 | 43.39 | 43.74 | 46.25 | 50.08 |
Robust Train | 11.70 | 11.70 | 11.70 | 99.30 | 99.50 | 64.19 | 79.82 | 97.37 | 99.57 |
In this section, we empirically verify our theoretical results in Section 4 and Section 5 by the numerical experiments in real-world image-recognition datasets and synthetic structured data, respectively.
6.1 Effect of Different Model Sizes
Experiment Settings. For the MNIST dataset, we consider a simple convolutional network, LeNet-5 (LeCun et al., 1998), and study how its performance changes as we increases the size of network (i.e. we expand the number of convolutional filters and the size of the fully connected layer to the size number multiple of the initial, where the size numbers are ). The original convolutional network has a convolutional layer with filters, followed by another convolutional layer with filters, and a fully connected hidden layer with 32 units. Convolutional layers are followed by max-pooling layers. For the CIFAR10 dataset, we apply WideResNet-34 (Zagoruyko & Komodakis, 2016) with different widen factors . We use the standared projected gradient descent (PGD) (Madry et al., 2017) to train the network by adversarial training. We choose the classical -perturbation with radius for MNIST and for CIFAR10. All models are trained for 100 epoches on MNIST and 200 epoches on CIFAR10 by using a single NVIDIA RTX 4090 GPU.
Experiment Results. We report our results about the performance (including the clean test accuracy, robust test accuracy and robust training accuracy) of models with different sizes in Table 1 and Figure 2 (a)(b). It shows that when the model size becomes larger, first the robust training loss decreases but the robust generalization gap remains large, and then when the model gets even larger, the robust generalization gap gradually decreases, and we also find that, in the small size case (see LeNet with the size number ), adversarial training converges to a trivia solution that always predicts a fixed class, while it can learn an accurate clean classifier through the standard training, which corresponds to the theoretical results in Theorem 4.4 and Theorem 4.7.
6.2 Synthetic Structured Data
Experiment Settings. Here we generate synthetic data exactly following Definition 5.1 and apply the two-layer convolutional network in Definition 5.3. We train the network by the adversarial training algorithm we mentioned in Definition 5.6. We choose the hyperparameters that we need as: , which is a feasible one under Assumption 5.7. We characterize the true feature learning and noise memorization via calculating and the smallest/largest/average of . We calculate the robust train and test accuracy of the model by using the standard PGD attack.
Accuracy | Train | Test |
---|---|---|
Clean | 100.0 | 98.5 |
Robust | 100.0 | 17.5 |
Experiment Results. We plot the dynamics of true feature learning and noise memorization in Figure 2 (c). It is clear that a three-stage phase transitions happen during adversarial training , which is consistent with our theoretical analysis of learning process (Lemma 5.11, Lemma 5.12 and Lemma 5.13), and finally the model partially learns true feature but exactly memorizes all training data (Theorem 5.9). The performance of the model is presented in Table 2, which shows that the network converges to a CGRO classifier eventually.
7 Conclusion
To summarize, we study the puzzling clean generalization and robust overfitting (CGRO) phenomenon in adversarial training and present theoretical explanations: from the perspective of representation complexity, we prove that the CGRO classifier is efficient to achieve by leveraging robust memorization regarding the training data, while robust generalization requires excessively large model capacity in worst case, which may lead adversarial training to the CGRO regime; from the perspective of training dynamics, we propose a three-stage analysis technique to analyze the feature learning process of adversarial training under our structured-data framework, and it shows that two-layer neural network trained by adversarial training provably learns the partial true feature but memorizes the random noise from training data, which thereby causes the CGRO regime. On the empirical side, we confirm our theoretical findings above by real-world vision datasets and synthetic data simulation.
References
- Allen-Zhu & Li (2020a) Allen-Zhu, Z. and Li, Y. Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413, 2020a.
- Allen-Zhu & Li (2020b) Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020b.
- Allen-Zhu & Li (2022) Allen-Zhu, Z. and Li, Y. Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 977–988. IEEE, 2022.
- Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
- Anthony et al. (1999) Anthony, M., Bartlett, P. L., Bartlett, P. L., et al. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999.
- Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019.
- Balaji et al. (2019) Balaji, Y., Goldstein, T., and Hoffman, J. Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets. arXiv preprint arXiv:1910.08051, 2019.
- Bartlett et al. (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
- Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Biggio et al. (2013) Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp. 387–402. Springer, 2013.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bubeck & Sellke (2023) Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1–18, 2023.
- Cao et al. (2021) Cao, Y., Gu, Q., and Belkin, M. Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems, 34:8407–8418, 2021.
- Chen et al. (2021) Chen, J., Cao, Y., and Gu, Q. Benign overfitting in adversarially robust linear classification. arXiv preprint arXiv:2112.15250, 2021.
- Chen et al. (2022) Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813, 2022.
- Chidambaram et al. (2023) Chidambaram, M., Wang, X., Wu, C., and Ge, R. Provably learning diverse features in multi-view data with midpoint mixup. In International Conference on Machine Learning, pp. 5563–5599. PMLR, 2023.
- Dan et al. (2020) Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345–2355. PMLR, 2020.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
- Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Frei et al. (2022) Frei, S., Chatterji, N. S., and Bartlett, P. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pp. 2668–2703. PMLR, 2022.
- Gao et al. (2019) Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and Lee, J. D. Convergence of adversarial training in overparametrized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Gowal et al. (2021) Gowal, S., Rebuffi, S.-A., Wiles, O., Stimberg, F., Calian, D. A., and Mann, T. A. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218–4233, 2021.
- Hassani & Javanmard (2022) Hassani, H. and Javanmard, A. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. arXiv preprint arXiv:2201.05149, 2022.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Javanmard & Soltanolkotabi (2022) Javanmard, A. and Soltanolkotabi, M. Precise statistical analysis of classification accuracies for adversarial training. The Annals of Statistics, 50(4):2127–2156, 2022.
- Jelassi & Li (2022) Jelassi, S. and Li, Y. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pp. 9965–10040. PMLR, 2022.
- Jelassi et al. (2022) Jelassi, S., Sander, M., and Li, Y. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022.
- Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-lojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795–811. Springer, 2016.
- Kileel et al. (2019) Kileel, J., Trager, M., and Bruna, J. On the expressive power of deep polynomial neural networks. Advances in neural information processing systems, 32, 2019.
- Kirillov et al. (2023) Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Kou et al. (2023) Kou, Y., Chen, Z., Chen, Y., and Gu, Q. Benign overfitting for two-layer relu networks. arXiv preprint arXiv:2303.04145, 2023.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li et al. (2022) Li, B., Jin, J., Zhong, H., Hopcroft, J. E., and Wang, L. Why robust generalization in deep learning is difficult: Perspective of expressive power. arXiv preprint arXiv:2205.13863, 2022.
- Li et al. (2019) Li, Y., Fang, E. X., Xu, H., and Zhao, T. Inductive bias of gradient descent based adversarial training on separable data. arXiv preprint arXiv:1906.02931, 2019.
- Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pang et al. (2022) Pang, T., Lin, M., Yang, X., Zhu, J., and Yan, S. Robustness and accuracy could be reconcilable by (proper) definition. In International Conference on Machine Learning, pp. 17258–17277. PMLR, 2022.
- Petzka et al. (2020) Petzka, H., Kamp, M., Adilova, L., Boley, M., and Sminchisescu, C. Relative flatness and generalization in the interpolation regime. arXiv preprint arXiv:2001.00939, 2020.
- Raghunathan et al. (2019) Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and Liang, P. Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019.
- Rice et al. (2020) Rice, L., Wong, E., and Kolter, Z. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093–8104. PMLR, 2020.
- Schmidt et al. (2018) Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. Advances in neural information processing systems, 31, 2018.
- Shafahi et al. (2019) Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.
- Shen et al. (2022) Shen, R., Bubeck, S., and Gunasekar, S. Data augmentation as feature manipulation: a story of desert cows and grass cows. arXiv preprint arXiv:2203.01572, 2022.
- Silverman (2018) Silverman, B. W. Density estimation for statistics and data analysis. Routledge, 2018.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Tsipras et al. (2018) Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Wong et al. (2020) Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
- Xiao et al. (2022) Xiao, J., Fan, Y., Sun, R., Wang, J., and Luo, Z.-Q. Stability analysis and generalization bounds of adversarial training. arXiv preprint arXiv:2210.00960, 2022.
- Xing et al. (2021) Xing, Y., Song, Q., and Cheng, G. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523–26535, 2021.
- Yang et al. (2020) Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R., and Chaudhuri, K. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
- Yarotsky (2017) Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- Yu et al. (2022) Yu, C., Han, B., Shen, L., Yu, J., Gong, C., Gong, M., and Liu, T. Understanding robust overfitting of adversarial training and beyond. In International Conference on Machine Learning, pp. 25595–25610. PMLR, 2022.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. iclr. arXiv preprint arXiv:1611.03530, 2017.
- Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472–7482. PMLR, 2019.
Appendix A Additional Experiments Regarding Robust Memorization








In this section, we demonstrate that adversarial training converges to similarities of the construction of Lemma 4.5 on real image datasets, which results in CGRO. In fact, we need to verify models trained by adversarial training tend to memorize data by approximating local robust indicators on training data.
Concretely, for given loss , instance and model , we use two measurements, maximum gradient norm within the neighborhood of training data,
and maximum loss function value change
.
The former measures the local flatness on , and the latter measures local adversarial robustness on , which both describe the key information of loss landscape over input.
Experiment Settings. In numerical experiments, we mainly focus on two common real-image datasets, MNIST and CIFAR10. During adversarial training, we use cyclical learning rates and mixed precision technique (Wong et al., 2020). For the MNIST dataset, we use a LeNet architecture and train total 20 epochs. For the CIFAR10 dataset, we use a Resnet architecture and train total 15 epochs.
Numerical Results. First, we apply the adversarial training method to train models by a fixed perturbation radius , and then we compute empirical average of maximum gradient norm and maximum loss change on training data within different perturbation radius . We can see numerical results in Figure 3 (ad), and it shows that loss landscape has flatness within the training radius, but is very sharp outside, which practically demonstrates our conjecture on real image datasets.
Learning Process. We also focus on the dynamics of loss landscape over input during the adversarial learning process. Thus, we compute empirical average of maximum gradient norm within different perturbation radius and in different training epochs. The numerical results are plotted in Figure 3 (eh). Both on MNIST and CIFAR10, with epochs increasing, it is observed that the training curve descents within training perturbation radius, which implies models learn the local robust indicators to robustly memorize training data. However, the test curve of CIFAR10 ascents within training radius instead, which is consistent with our theoretical analysis in Section 5.
Robust Generalization Bound. Moreover, we prove a robust generalization bound based on global flatness of loss landscape (see in Appendix B). We show that, while adversarial training achieves local flatness by robust memorization, the model lacks global flatness, which causes robust overfitting.
Appendix B Robust Generalization Bound Based on Global Flatness
In this section, we prove a novel robust generalization bound that mainly depends on global flatness of loss landscape. We consider adversarial robustness with perturbation radius and we use , and to denote the clean test risk, the adversarial test risk and the adversarial empirical risk w.r.t. the model , respectively. We also assume for the next results.
Theorem B.1.
(Robust Generalization Bound) Let be the underlying distribution with a smooth density function, and sample training dataset is i.i.d. drawn from . Then, with high probability, it holds that,
This generalization bound shows that robust generalization gap can be dominated by global flatness of loss landscape. And we also have the lower bound of robust generalization gap stated as follow.
Proposition B.2.
Let be the underlying distribution with a smooth density function, then we have
Theorem B.1 and Proposition B.2 manifest that robust generalization gap is very related to global flatness. However, although adversarial training achieves good local flatness by robust memorization on training data, the model lacks global flatness, which leads to robust overfitting.
This point is also verified by numerical experiment on CIFAR10 (see results in Figure 4). First, global flatness grows much faster than local flatness in practice. Second, with global flatness increasing during training process, it causes an increase of robust generalization gap.


Appendix C Preliminary Lemmas
Lemma C.1.
Let be a positive sequence defined by the following recursions
where is the initialization and . Let such that . Then, the time such that for all is:
Lemma C.2.
Let be a positive sequence defined by the following recursion
where and is the initialization. Assume that . Let such that . Then, the time t such that is upper bounded as:
Lemma C.3.
Let . Let be a non-negative sequence that satisfies the recursion: , for . Then, it is bounded at a time as
Then, we provide a probability inequality proved by Jelassi & Li (2022).
Lemma C.4.
Let be vectors in such that there exist a unit norm vector that satisfies . Then, for i.i.d., we have:
Next, we introduce some concepts about learning theory.
Definition C.5 (growth function).
Let be a class of functions from to . For any integer , we define the growth function of to be
In particular, if , then is said to be shattered by .
Definition C.6 (Vapnik-Chervonenkis dimension).
Let be a class of functions from to . The VC-dimension of , denoted by , is defined as the largest integer such that . For real-value function class , we define .
The following result gives a nearly-tight upper bound on the VC-dimension of neural networks.
Lemma C.7.
(Bartlett et al., 2019) Consider a ReLU network with layers and total parameters. Let be the set of (real-valued) functions computed by this network. Then we have .
The growth function is connected to the VC-dimension via the following lemma; see e.g. Anthony et al. (1999).
Lemma C.8.
Suppose that , then . In particular, we have for all .
Appendix D Feature Learning
In this section, we provide a full introduction to feature learning, which is widely applied in theoretical works (Allen-Zhu & Li, 2020a, b, 2022; Shen et al., 2022; Jelassi & Li, 2022; Jelassi et al., 2022; Chen et al., 2022) explore what and how neural networks learn in different tasks. In this work, we first leverage feature learning theory to explain CGRO phenomenon in adversarial training. Specifically, for an arbitrary clean training data-point and a given model , we focus on
-
•
True Feature Learning. We project the weight on the meaningful signal vector to measure the correlation between the model and the true feature as
-
•
Spurious Feature Learning. We project the weight on the random noise to measure the correlation between the model and the spurious feature as
We then calculate the model’s classification correctness on certain clean data point as
Thus, the model correctly classify the data if and only if , which holds in at least two cases. Indeed, one is that the model learns the true feature and ignores the spurious features, where . Another is that the model doesn’t learn the true feature but memorizes the spurious features, where and .
Therefore, this analysis tells us that the model will generalize well for unseen data if the model learns true feature learning. But the model will overfit training data if the model only memorizes spurious features since the data-specific random noises are independent for distinct instances, which means that, with high probability, it holds that for unseen data .
We also calculate the model’s classification correctness on perturbed data point, where we use attack proposed in definition 5.6 to generate adversarial example as
We then derive the correctness as
Thus, the model correctly classify the perturbed data if and only if , which implies that We can analyze the perturbed data similarly.
Appendix E Proof for Section 5
In this section, we present the full proof for Section 5. To simplify our proof, we only focus on the case when , and it can be easily extended to the general case when . And we need the re-defined notation as the following.
For , and , we define and as
Signal Component. , thus .
Noise Component. , thus .
First, we give detailed proofs of Lemma 5.11, Lemma 5.12 and Lemma 5.13. Then, we prove Theorem 5.9 base on the above lemmas.
We prove our main results using an induction. More specifically, we make the following assumptions for each iteration .
Hypothesis E.1.
Throughout the learning process using the adversarial training update for , we maintain that:
-
•
(Uniform Bound for Signal Component) For each , we assume .
-
•
(Uniform Bound for Noise Component) For each , and , we assume .
In what follows, we assume these induction hypotheses for to prove our main results. We then prove these hypotheses for iteration in Lemma E.11.
Now, we first give proof details about Lemma 5.11.
Theorem E.2.
Proof.
First, we calculate the gradient of adversarial loss with respect to as
Then, we project the gradient descent algorithm equation on the signal vector . We derive the following result due to for .
where we derive last inequality by using , which is obtained due to Hypothesis E.1.
∎
Consequently, we have the following result that shows the order of maximum signal component.
Lemma E.3.
During adversarial training, with high probability, it holds that, after iterations, for all , we have .
Proof.
From the proof of Theorem E.2, we know that
By applying Hypothesis E.1, we can simplify the above equation to the following inequalities.
where and are respectively defined as:
At initialization, since we choose the weights , we know the initial signal components are i.i.d. zero-mean Gaussian random variables, which implies that the probability that at least one of the is non-negative is .
Thus, with high probability, there exists an initial signal component . By using Tensor Power Method (Lemma C.1) and setting , we have the threshold iteration as
∎
Next, we prove Lemma 5.12 to give an upper bound of signal components’ growth.
Theorem E.4.
(Restatement of Lemma 5.12) For and any , the signal component is upper bounded as
Proof.
First, we analyze the upper bound of derivative generated by clean data. By following the proof of Theorem E.2, we know that, for each ,
where we obtain the first inequality by the definition of , and we use derived by Lemma E.3 in the last inequality. Thus, we then have
Now, we focus on . By the non-decreasing property of , we have
where we use to denote the logistics function defined as and we derive the last inequality by Hypothesis E.1. Then, we know
Then, we derive the following result by Hypothesis E.1 and the above inequality.
By summing up iteration , we have the following result as
Therefore, we derive the conclusion of Theorem E.4. ∎
Next, we prove the following theorem about the update of noise components.
Lemma E.5.
For each , and , any iteration such that , with high probability, it holds that
where we use the notation to denote .
Proof.
To obtain Lemma E.5, we prove the following stronger result by induction w.r.t. iteration .
(1) | ||||
First, we project the training update on noise patch to verify the above inequality when as
where we apply and to derive the last inequality.
Next, we assume that the stronger result holds for iteration , and then we prove the result for iteration as follow.
Then, we bound the first term in the right of the above inequality by our induction hypothesis for , and we can derive
By summing up the terms, we proved the stronger result for .
Finally, we simplify the form of stronger result by using , which implies the conclusion of Lemma E.5. ∎
Theorem E.6.
(Restatement of Lemma 5.13) For each , and and any , the signal component grows as
Proof.
Consequently, we derive the upper bound of total noise components as follow.
Lemma E.7.
During adversarial training, with high probability, it holds that, after iterations, for all and each , we have .
Proof.
By applying Lemma E.5 as the same in the proof of Theorem E.6, we know that
which implies that, for any iteration , we have
where are constants defined as
At initialization, since we choose the weights and , we know the initial noise components are i.i.d. zero-mean Gaussian random variables, which implies that, with high probability, there exists at least one index such that . By using Tensor Power Method (Lemma C.2) and setting , we have the threshold iteration as
Therefore, we get , and we use to derive . ∎
Indeed, our aimed loss function is non-convex due to the non-linearity of our CNN model . To analyze the convergence of gradient algorithm, we need to prove the following condition that is used to show non-convexly global convergence (Karimi et al., 2016; Li et al., 2019).
Lemma E.8.
(Lojasiewicz Inequality for Non-convex Optimization) During adversarial training, with high probability, it holds that, after iterations, for all , we have
Proof.
To prove Lojasiewicz Inequality, we first recall the gradient w.r.t. as
Then, we project the gradient on the signal direction and total noise, respectively.
For the signal component, we have
For the total noise component, we have
For the first term, with high probability, it holds that
where we use that is a sub-Gaussian random variable of parameter , which implies w.h.p. .
For the second term, with high probability, it holds that
where we use w.h.p. for .
Now, combine the above bounds, we derive
∎
Consequently, we derive the following sub-linear convergence result by applying Lojasiewicz Inequality.
Lemma E.9.
(Sub-linear Convergence for Adversarial Training) During adversarial training, with high probability, it holds that, after iterations, the adversarial training loss sub-linearly converges to zero as
Proof.
Due to the smoothness of loss function and learning rate , we have
where we use Lojasiewicz Inequality in the last inequality. Then, by applying Tensor Power Method (Lemma C.3), we obtain the sub-linear convergence rate. ∎
Now, we present the following result to bound the derivative generated by training-adversarial examples.
Lemma E.10.
During adversarial training, with high probability, it holds that, after iterations, we have
Proof.
First, we bound the total derivative during iteration . By applying the conclusion of Lemma E.5, we have
Due to , we know
Thus, we obtain ∎
Consequently, we have the following lemma that verifies Hypothesis E.1 for .
Lemma E.11.
During adversarial training, with high probability, it holds that, for any , we have and for each .
Proof.
Finally, we prove our main result as follow.
Theorem E.12.
(Restatement of Theorem 5.9) Under Assumption 5.7, we run the adversarial training algorithm to update the weight of the simplified CNN model for iterations. Then, with high probability, it holds that the CNN model
-
1.
partially learns the true feature, i.e. ;
-
2.
exactly memorizes the spurious feature, i.e. for each ,
where and is defined for th instance and th iteration as the same in (• ‣ 5.8)(• ‣ 5.8). Consequently, the clean test error and robust training error are both smaller than , but the robust test error is at least .
Proof.
First, by applying Lemma E.3, Lemma E.7 and Lemma E.11, we know for any
Then, since adversarial loss sub-linearly converges to zero i.e. , the robust training error is also at most .
To analyze test errors, we decompose into for each , where . Due to , we know .
For the clean test error, we have
where we use the fact that is a sub-Gaussian random variable with parameter .
Appendix F Proof for Section 4
Theorem F.1.
(Restatement of Theorem 4.4) Under Assumption 4.1, 4.2 and 4.3, with sample training dataset drawn from the data distribution , there exists a CGRO classifier that can be represented as a ReLU network with parameters, which means that, under the distribution and dataset , the network achieves zero clean test and robust training errors but its robust test error is at least .
Proof.
First, we give the following useful results about function approximation by ReLU nets.
Lemma F.2.
(Yarotsky, 2017) The function on the segment can be approximated with any error by a ReLU network having the depth and the number of weights and computation units .
Lemma F.3.
(Yarotsky, 2017) Let , and be given. There exists a function computed by a ReLU network with parameters such that
and if .
Since for , the distance function , by using Lemma F.2, there exists a function computed by a ReLU network with parameters such that .
Return to our main proof back, indeed, functions computed by ReLU networks are piecewise linear but the indicator functions are not continuous, so we need to relax the indicator such that for , for and is linear in by using only two ReLU neurons, where is sufficient small for approximation.
Next, we prove Theorem 4.7 by using the VC-dimension theory.
Theorem F.4.
Proof.
Now, we notice that ReLU networks are piece-wise linear functions. Montufar et al. (2014) study the number of local linear regions, which provides the following result.
Proposition F.5.
The maximal number of linear regions of the functions computed by any ReLU network with a total of hidden units is bounded from above by .
Thus, for a given clean classifier represented by a ReLU net with parameters, we know there exists at least a local region such that decision boundary of is linear hyperplane in . And we assume that the hyperplane is .
Then, let be the projection of on the decision boundary of , and be an -packing of . Since the packing number , where is the -covering number of a set . For any , we can consider the construction
where is an arbitrary mapping. It’s easy to see that all points in with first components satisfying are in , so that by choosing sufficiently small, we can guarantee that . For convenience we just replace with from now on.
Let , . It’s easy to see that for arbitrary , the construction is linear-separable and satisfies -separability.
Assume that for any choices of , the induced sets and can always be robustly classified with -accuracy by a ReLU network with at most parameters. Then, we can construct an enveloping network with hidden layers, neurons per layer and at most parameters such that any network with size can be embedded into this envelope network. As a result, is capable of -robustly classify any sets induced by arbitrary choices of . We use to denote the subset of satisfying such that can be -robustly classified.
Next, we estimate the lower and upper bounds for the cardinal number of the vector set
Let denote , then we have
where .
On one hand, we know that for any , there exists a such that , where denotes the Hamming distance, then we have
On the other hand, by applying Lemma C.8, we have
where is the VC-dimension of . In fact, we can derive when is a small constant. Assume that , then we have and , so
We define a function as , then we derive
When is sufficient small, that is a constant only depending on , which implies . Finally, by using Lemma C.7 and , we know . ∎
Appendix G Proof for Section B
Theorem G.1.
(Restatement of Theorem B.1) Let be the underlying distribution with a smooth density function, and sample training dataset is i.i.d. drawn from . Then, with high probability, it holds that,
Proof.
Indeed, we notice the following loss decomposition,
To bound the first term, by applying to denote kernel density estimation (KDE) proposed in Petzka et al. (2020), then we derive
where is the density function of the distribution , and is the KDE of point .
With the smoothness of density function of and Silverman (2018), we know that .
For (II), by using Chebychef inequality and Silverman (2018), with probability , we have
On the other hand, by Taylor expansion, we know
Combined with the bounds for and , we can derive Theorem B.1. ∎